## CCLE

## Description

This CCLE dataset is made by performing the usual bulk RNA-seq technique, then normalizing using RPKM (Reads Per Kilobase of transcript per Million) to reduce variance and make samples comparable.

The RPKM normalization solves two problems created during sequencing:

1. **Sequencing depth**, which happens when samples are sequenced to different depths. Some samples may have more or fewer total reads, but that difference does not reflect biology, so we correct for it.  
2. **Gene length**, because longer genes generate more reads just by being longer, and shorter genes fewer reads. Normalizing by gene length accounts for that.

The normalized \((i, j)\) RPKM value, where \(i\) is a gene and \(j\) is a sample, is:

$$
\mathrm{RPKM}_{i,j}
=
\frac{x_{i,j}}{\,l_i \cdot \sum_k x_{k,j}\,}
\times 10^6
$$

<!-- blank line above is important -->
- \(x_{i,j}\) is the raw read count for gene \(i\) in sample \(j\)  
- \(l_i\) is the length of gene \(i\) in kilobases (kb)  
- \(\sum_{k} x_{k,j}\) is the total reads in sample \(j\)  

In [1]:
import pandas as pd
import numpy as np

In [2]:
# ---------mRNA---------
# Description:
# Expression Data with gene as index and cells/samples as columns
data_mrna_seq_rpkm = pd.read_csv('ccle_data/data_mrna_seq_rpkm.txt',
                    sep = '\t',
                    comment = '#')

data_mrna_seq_rpkm.set_index('Hugo_Symbol',inplace=True)

# Merge with mean duplicated rows
data_mrna_seq_rpkm = data_mrna_seq_rpkm.groupby(data_mrna_seq_rpkm.index).mean()

In [3]:
data_mrna_seq_rpkm.head()

Unnamed: 0_level_0,22RV1_PROSTATE,2313287_STOMACH,253JBV_URINARY_TRACT,253J_URINARY_TRACT,42MGBA_CENTRAL_NERVOUS_SYSTEM,5637_URINARY_TRACT,59M_OVARY,639V_URINARY_TRACT,647V_URINARY_TRACT,697_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,...,UMUC16_URINARY_TRACT,UMUC4_URINARY_TRACT,UMUC5_URINARY_TRACT,UMUC6_URINARY_TRACT,UMUC7_URINARY_TRACT,UMUC9_URINARY_TRACT,UPCISCC152_UPPER_AERODIGESTIVE_TRACT,UW228_CENTRAL_NERVOUS_SYSTEM,Y79_AUTONOMIC_GANGLIA,YAMATO_SOFT_TISSUE
Hugo_Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CRIPTOP1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7SK,0.029785,0.03377,0.047608,0.02094,0.0043,0.017847,0.006893,0.00704,0.008042,0.148955,...,0.083175,0.02683,0.002032,0.024993,0.094365,0.393628,0.035012,0.002465,0.05875,0.067953
A1BG,0.3623,0.00608,0.11517,0.33737,3.06452,0.01339,2.18016,2.24186,0.0954,3.69805,...,0.0127,0.09324,0.01134,0.06009,0.02638,0.01923,0.50558,0.5786,0.71115,0.84912
A1BG-AS1,4.81588,0.18959,0.75151,1.32578,9.15532,0.06459,1.61922,4.48351,0.63141,6.48321,...,0.37136,0.52954,0.18746,0.12101,0.35924,0.80913,2.75065,2.91806,2.06849,2.5964
A1CF,5.65486,2.55482,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.12239,0.0,0.0,0.0,0.0,0.0,0.01154,0.00589,0.0


In [4]:
#--------Mut query----------
# Description:
# Data telling  which sample is mutated and not-mutated
mutations= pd.read_csv('ccle_data/mutations.txt',
                    sep = '\t',
                    comment = '#')
mutations.set_index('SAMPLE_ID',inplace=True)

In [5]:
mutations.head()

Unnamed: 0_level_0,STUDY_ID,TP53
SAMPLE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
SJRH30_SOFT_TISSUE,ccle_broad_2019,R280S R273C Y205C
FADU_UPPER_AERODIGESTIVE_TRACT,ccle_broad_2019,R248L X225_splice
HH_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,ccle_broad_2019,X187_splice
SKNMC_BONE,ccle_broad_2019,WT
SNU182_LIVER,ccle_broad_2019,S215I


## Why predict on just the five Variant_Type classes

We focus on SNP, DNP, ONP, INS, and DEL because:

1. Biological clarity: these five labels describe the fundamental mutation mechanism, single-base or multi-base changes, insertions, and deletions, so the model learns clear patterns.  
2. Balanced data: each Variant_Type occurs often enough to give the model enough examples, while the detailed Variant_Classification labels are very uneven and would leave some classes too small to learn.  
3. Reduced complexity: Variant_Classification depends on gene structure and reading frame (for example a SNP can be silent or missense depending on codon), which our sequence‐only model cannot infer without extra annotation.  
4. Modular workflow: once the model tags a variant as INS or DEL, we can apply separate rules or a second model to predict functional impact, keeping each step simpler and more reliable.  


In [6]:
# ------Mut ALL-----------
# Dataset containing mutation classes for all genes, it only contains mutated samples
data_mutations = pd.read_csv('ccle_data/data_mutations.txt',
                    sep = '\t',
                    comment = '#')

# Extract TP53 from all genes
data_mutations = data_mutations[data_mutations['Hugo_Symbol'] == 'TP53']

  data_mutations = pd.read_csv('ccle_data/data_mutations.txt',


In [7]:
unique_labels = data_mutations['Variant_Type'].unique()
print(unique_labels)

['SNP' 'DNP' 'DEL' 'INS']


In [8]:
unique_labels = data_mutations['Variant_Classification'].unique()
print(unique_labels)

['Missense_Mutation' 'Nonsense_Mutation' 'Splice_Site' 'Silent'
 'Splice_Region' 'Frame_Shift_Del' 'Frame_Shift_Ins' 'In_Frame_Del'
 'In_Frame_Ins']


In [9]:
# Remove unwanted information
data_mutations = data_mutations[['Tumor_Sample_Barcode', 'Variant_Type']]
data_mutations.set_index('Tumor_Sample_Barcode', inplace=True)
# There are repetitions of my mutation type (Variant_Type)
# if there is the same sample with different Variant_Type it should be removed
variant_check = data_mutations.groupby(data_mutations.index)["Variant_Type"].nunique()

In [10]:
data_mutations.head()

Unnamed: 0_level_0,Variant_Type
Tumor_Sample_Barcode,Unnamed: 1_level_1
22RV1_PROSTATE,SNP
A431_SKIN,SNP
A4FUK_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,SNP
BICR16_UPPER_AERODIGESTIVE_TRACT,SNP
BICR78_UPPER_AERODIGESTIVE_TRACT,SNP


In [11]:
# Count how many mutations are in each Variant_Type
counts = data_mutations['Variant_Type'].value_counts()

# Print them one per line
for variant_type, n in counts.items():
    print(f"{variant_type}: {n}")


SNP: 1042
DEL: 116
INS: 34
DNP: 26


In [12]:
data_t = data_mrna_seq_rpkm.T

data_t.head()

Hugo_Symbol,CRIPTOP1,7SK,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,A2ML1-AS2,...,snoZ13_snr52,snoZ178,snoZ185,snoZ247,snoZ40,snoZ5,snoZ6,snosnR60_Z15,snosnR66,yR211F11.2
22RV1_PROSTATE,0.0,0.029785,0.3623,4.81588,5.65486,1.98954,1.27348,0.0196,0.9107,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.34194,0.0,0.0,0.0755
2313287_STOMACH,0.0,0.03377,0.00608,0.18959,2.55482,0.00782,0.22274,0.01738,0.32801,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.239685,0.0,0.0,0.0
253JBV_URINARY_TRACT,0.0,0.047608,0.11517,0.75151,0.0,0.05044,0.0,0.00544,0.60473,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
253J_URINARY_TRACT,0.0,0.02094,0.33737,1.32578,0.0,0.02878,0.0,0.00547,0.46686,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42MGBA_CENTRAL_NERVOUS_SYSTEM,0.0,0.0043,3.06452,9.15532,0.0,0.02477,0.0,0.0,0.099,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.000913,0.0,0.0,0.0


In [13]:
data_t.loc['253JBV_URINARY_TRACT']

Hugo_Symbol
 CRIPTOP1       0.000000
7SK             0.047608
A1BG            0.115170
A1BG-AS1        0.751510
A1CF            0.000000
                  ...   
snoZ5           0.000000
snoZ6           0.000000
snosnR60_Z15    0.000000
snosnR66        0.000000
yR211F11.2      0.000000
Name: 253JBV_URINARY_TRACT, Length: 54353, dtype: float64

In [14]:
code = {"SNP": 0, "DNP": 1, "DEL": 2, "INS": 3}

In [15]:
# Build target vector y
y = []

# prepare lists to collect rows and their sample name
X_rows = []
sample_names = []
c = 0

# iterate over each mutation record
for bc, mut in data_mutations.iterrows():
    # check if this barcode is in data_t’s index
    if bc in data_t.index:
        # grab the full row from data_t and store it
        X_rows.append(data_t.loc[bc].values)
        y.append(code[mut['Variant_Type']])
        sample_names.append(bc)
    else:
        c += 1

print(f"Number of samples discarded: {c}")

# build a new DataFrame X from the collected rows
X = pd.DataFrame(
    X_rows,
    index=sample_names,
    columns=data_t.columns
)

y = np.array(y)

Number of samples discarded: 287


In [16]:
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (931, 54353)
y shape: (931,)


**Step 2: Train - Test split (80% - 20%)**

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
# 80% train, 20% test, stratify to preserve class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Train shape:\n\tX_train: {X_train.shape}\n\ty_train: {y_train.shape}")
print(f"Test shape:\n\tX_test: {X_test.shape}\n\ty_test: {y_test.shape}")

Train shape:
	X_train: (744, 54353)
	y_train: (744,)
Test shape:
	X_test: (187, 54353)
	y_test: (187,)


**Step 3: Model selection and Training**

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

In [20]:
# Instantiate the model
rf = RandomForestClassifier(
    n_estimators=500,      # you can tune this
    max_depth=None,        # full depth; you can limit for speed/regularization
    random_state=42,
    n_jobs=-1              # use all cores
)

In [21]:
# Train
rf.fit(X_train, y_train)

In [33]:
# Predict
y_pred = rf.predict(X_test)

n = y_test.shape[0]
count_mispred = 0
for i in range(n):
    if y_test[i] != y_pred[i]:
        count_mispred += 1

# Compute the percentage of mispredictions (accuracy)
percentage_mispred = count_mispred / n
print(f'Accuracy: {(1-percentage_mispred)*100:.2f}%')



RuntimeError: Numpy is not available

In [23]:
# PyTorch classifier
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split

# Train-Test split
X_train_np, X_test_np, y_train_np, y_test_np = train_test_split(
    X.values, y, test_size=0.2, random_state=42, stratify=y
)

# Convert to tensors
X_train = torch.tensor(X_train_np, dtype=torch.float32)
y_train = torch.tensor(y_train_np, dtype=torch.long)
X_test  = torch.tensor(X_test_np,  dtype=torch.float32)
y_test  = torch.tensor(y_test_np,  dtype=torch.long)

# Wrap in DataLoaders
train_ds = TensorDataset(X_train, y_train)
test_ds  = TensorDataset(X_test,  y_test)
train_dl = DataLoader(train_ds, batch_size=32, shuffle=True)
test_dl  = DataLoader(test_ds,  batch_size=32)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/local/Cellar/python@3.10/3.10.17/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/python@3.10/3.10.17/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/lorenzoturiano/Bocconi/ML-lab/ml_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users

In [24]:
# Define the model: a small MLP
class MLP(nn.Module):
    def __init__(self, in_features, hidden=128, n_classes=4):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features, hidden),
            nn.ReLU(),
            nn.Linear(hidden, n_classes)
        )
    def forward(self, x):
        return self.net(x)

In [28]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model  = MLP(in_features=X_train.shape[1]).to(device)
loss_fn = nn.CrossEntropyLoss()
opt     = torch.optim.Adam(model.parameters(), lr=1e-3)

# 5) training loop
n_epochs = 50
for epoch in range(n_epochs):
    model.train()
    total_loss = 0
    for xb, yb in train_dl:
        xb, yb = xb.to(device), yb.to(device)
        preds  = model(xb)
        loss   = loss_fn(preds, yb)
        opt.zero_grad()
        loss.backward()
        opt.step()
        total_loss += loss.item()
    avg_loss = total_loss / len(train_dl)
    print(f"Epoch {epoch+1}/{n_epochs}, loss: {avg_loss:.4f}")

Epoch 1/50, loss: 91.3287
Epoch 2/50, loss: 24.3034
Epoch 3/50, loss: 6.1718
Epoch 4/50, loss: 2.4856
Epoch 5/50, loss: 0.9281
Epoch 6/50, loss: 1.1944
Epoch 7/50, loss: 0.6243
Epoch 8/50, loss: 0.4053
Epoch 9/50, loss: 0.2899
Epoch 10/50, loss: 0.2742
Epoch 11/50, loss: 0.2282
Epoch 12/50, loss: 0.2178
Epoch 13/50, loss: 0.4175
Epoch 14/50, loss: 0.5181
Epoch 15/50, loss: 0.2881
Epoch 16/50, loss: 0.2617
Epoch 17/50, loss: 0.2401
Epoch 18/50, loss: 0.1896
Epoch 19/50, loss: 0.2083
Epoch 20/50, loss: 0.2814
Epoch 21/50, loss: 0.2229
Epoch 22/50, loss: 0.4478
Epoch 23/50, loss: 0.4564
Epoch 24/50, loss: 0.3685
Epoch 25/50, loss: 0.3428
Epoch 26/50, loss: 0.2773
Epoch 27/50, loss: 0.2629
Epoch 28/50, loss: 0.2645
Epoch 29/50, loss: 0.2107
Epoch 30/50, loss: 0.2252
Epoch 31/50, loss: 0.2297
Epoch 32/50, loss: 0.1862
Epoch 33/50, loss: 0.1858
Epoch 34/50, loss: 0.2133
Epoch 35/50, loss: 0.1842
Epoch 36/50, loss: 0.1547
Epoch 37/50, loss: 0.1485
Epoch 38/50, loss: 0.1992
Epoch 39/50, loss: 

In [32]:
# Evaluation
model.eval()
correct = 0
total   = 0
with torch.no_grad():
    for xb, yb in test_dl:
        xb, yb = xb.to(device), yb.to(device)
        preds  = model(xb).argmax(dim=1)
        correct += (preds == yb).sum().item()
        total   += yb.size(0)
print(f"Test accuracy: {correct/total:.2%}")

Test accuracy: 78.61%


## What is a confusion matrix

A confusion matrix is a table that shows how often a classifier’s predictions match the true labels.  
- Each row is a true class  
- Each column is a predicted class  
- The cell at row i and column j is the count of samples whose true label is class i but the model predicted class j  

## How we use it in this context

We built a model to predict one of four mutation types: SNP, DNP, DEL, INS.  
After running the model on the test set, we collect the true labels and the predicted labels, then build the confusion matrix. This tells us which mutation types the model gets right and which it confuses.

In [27]:
from sklearn.metrics import confusion_matrix
import pandas as pd

# collect all true labels and predictions
all_labels = []
all_preds  = []

model.eval()
with torch.no_grad():
    for xb, yb in test_dl:
        xb, yb = xb.to(device), yb.to(device)
        preds = model(xb).argmax(dim=1)
        all_preds.extend(preds.cpu().tolist())
        all_labels.extend(yb.cpu().tolist())

# compute confusion matrix
cm = confusion_matrix(all_labels, all_preds)

# optional: pretty print with class names
class_names = ['SNP', 'DNP', 'DEL', 'INS']
cm_df = pd.DataFrame(cm, index=class_names, columns=class_names)
print(cm_df)

     SNP  DNP  DEL  INS
SNP  119    0   39    2
DNP    2    0    1    0
DEL   14    0    5    0
INS    5    0    0    0


## The obtained confusion matrix

Here is the table we got (rows = true label, columns = predicted label):

| True \ Predicted | SNP | DNP | DEL | INS |
|------------------|-----|-----|-----|-----|
| **SNP**          | 148 |   0 |   8 |   4 |
| **DNP**          |   3 |   0 |   0 |   0 |
| **DEL**          |  16 |   0 |   3 |   0 |
| **INS**          |   5 |   0 |   0 |   0 |

### What this tells us

- **SNP**: 148 correct, 12 wrong (8 called DEL, 4 called INS)  
- **DNP**: 0 correct, all 3 called SNP  
- **DEL**: 3 correct, 16 called SNP  
- **INS**: 0 correct, all 5 called SNP  

The model is strong at spotting SNPs but it almost always labels DNP, DEL and INS as SNP.

## Class distribution in the full dataset

| Class | Samples |
|-------|--------:|
| SNP   |   1042  |
| DEL   |    116  |
| INS   |     34  |
| DNP   |     26  |

### Reasoning

Because SNP makes up most of the data, the model sees many more SNP examples during training. This imbalance makes it easier for the model to learn SNP patterns and harder to learn the rarer classes (DEL, INS, DNP). As a result the model is biased toward predicting SNP. To fix this we can add more samples for the rare classes or use techniques like resampling or class weights to balance the training.  
