# 🧬 Stanford RNA 3D Folding Competition: Structure Prediction Masterclass 🧠

<a href="https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding" target="_blank">
  <img src="https://img.shields.io/badge/Kaggle-Competition-blue?style=for-the-badge&logo=kaggle" alt="Kaggle Competition">
</a>

🌟 **Welcome to the Ultimate RNA 3D Structure Prediction Notebook!** 🌟

*Unlock the secrets of RNA folding through graph deep learning. This comprehensive guide combines biological insights with advanced AI techniques to tackle one of molecular biology's most challenging problems.*

---

## 🚀 Notebook Roadmap: From Sequence to Structure

<div style="padding: 15px; border: 2px solid #2ecc71; border-radius: 10px; margin: 20px 0;">
🔍 **Section 1: Exploratory Data Analysis (EDA)**  
   - 📌 3D coordinate distribution analysis  
   - 📌 Sequence pattern visualization  
   - 📌 Structural thermodynamics insights
</div>

<div style="padding: 15px; border: 2px solid #3498db; border-radius: 10px; margin: 20px 0;">
🔧 **Section 2: Data Preprocessing Pipeline**  
   - 🧬 RNA sequence encoding  
   - 🎯 Coordinate normalization & outlier detection  
   - 🧩 Dynamic padding for variable-length sequences
</div>

<div style="padding: 15px; border: 2px solid #9b59b6; border-radius: 10px; margin: 20px 0;">
🤖 <strong>Section 3: Graph Neural Network Architecture</strong>  
</div>

<div style="padding: 15px; border: 2px solid #e67e22; border-radius: 10px; margin: 20px 0;">
⚙️ <strong>Section 4: Model Training Strategies</strong><br>  
🎯 Loss: Weighted MAE + Structural Consistency<br>  
⏱️ Early stopping with 3D validation<br>  
🔄 Gradient clipping (norm=1.0)  
</div>


## 1. Data  Loading and Exploration

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# TensorFlow/Keras for deep learning model
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Conv1D, BatchNormalization, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
#For GCN MODEL
from spektral.layers import GCNConv
from spektral.utils import adjacency_to_edge_list

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)


We load the CSV files provided in the competition:
- `train_sequences.csv`
- `train_labels.csv`
- `validation_sequences.csv` & `validation_labels.csv`
- `test_sequences.csv`
- `sample_submission.csv`



In [None]:
TRAIN_SEQ_PATH = '/kaggle/input/stanford-rna-3d-folding/train_sequences.csv'
TRAIN_LABELS_PATH = '/kaggle/input/stanford-rna-3d-folding/train_labels.csv'
VALID_SEQ_PATH = '/kaggle/input/stanford-rna-3d-folding/validation_sequences.csv'
VALID_LABELS_PATH = '/kaggle/input/stanford-rna-3d-folding/validation_labels.csv'
TEST_SEQ_PATH  = '/kaggle/input/stanford-rna-3d-folding/test_sequences.csv'
SAMPLE_SUB_PATH = '/kaggle/input/stanford-rna-3d-folding/sample_submission.csv'

train_sequences = pd.read_csv(TRAIN_SEQ_PATH)
train_labels = pd.read_csv(TRAIN_LABELS_PATH)
valid_sequences = pd.read_csv(VALID_SEQ_PATH)
valid_labels = pd.read_csv(VALID_LABELS_PATH)
test_sequences = pd.read_csv(TEST_SEQ_PATH)
sample_submission = pd.read_csv(SAMPLE_SUB_PATH)

train_labels.fillna(0, inplace=True)
valid_labels.fillna(0, inplace=True)

print("Train Sequences Shape:", train_sequences.shape)
print("Train Labels Shape:", train_labels.shape)
print("Validation Sequences Shape:", valid_sequences.shape)
print("Validation Labels Shape:", valid_labels.shape)
print("Test Sequences Shape:", test_sequences.shape)

print("\nTrain Sequences Head:")
print(train_sequences.head())
print("\nTrain Labels Head:")
print(train_labels.head())

**histogram of sequence lengths across train, validation, and test sets**
The model pads sequences to a maximum length (4298 in the training set), but understanding the distribution can inform padding strategies or model architecture (e.g., handling variable lengths better).

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(train_sequences['sequence'].str.len(), bins=50, alpha=0.7, label='Train')
plt.hist(valid_sequences['sequence'].str.len(), bins=50, alpha=0.7, label='Validation')
plt.hist(test_sequences['sequence'].str.len(), bins=50, alpha=0.7, label='Test')
plt.xlabel('Sequence Length')
plt.ylabel('Frequency')
plt.title('Distribution of RNA Sequence Lengths')
plt.legend()
plt.show()

**Calculating and visualizing the proportion of each nucleotide**

The frequency of A, C, G, and U might influence folding patterns or model bias.


In [None]:
from collections import Counter
train_nucleotides = ''.join(train_sequences['sequence'])
counts = Counter(train_nucleotides)
plt.bar(counts.keys(), counts.values())
plt.title('Nucleotide Composition in Training Sequences')
plt.xlabel('Nucleotide')
plt.ylabel('Count')
plt.show()

**histograms or boxplots of coordinates and compute statistics**

Understanding the range and spread of x, y, z coordinates can help assess data scale and whether normalization is needed.

In [None]:
coords = np.vstack([train_labels_dict[tid] for tid in train_labels_dict])
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, coord in enumerate(['x_1', 'y_1', 'z_1']):
    axes[i].hist(coords[:, i], bins=50)
    axes[i].set_title(f'Distribution of {coord}')
plt.tight_layout()
plt.show()

**Convert to datetime and explore trends over time**

In [None]:
train_sequences['temporal_cutoff'] = pd.to_datetime(train_sequences['temporal_cutoff'])
plt.figure(figsize=(10, 6))
train_sequences['temporal_cutoff'].dt.year.value_counts().sort_index().plot(kind='bar')
plt.title('Sequences by Year of Temporal Cutoff')
plt.xlabel('Year')
plt.ylabel('Number of Sequences')
plt.show()

## 2. Data Preprocessing

### Sequence Encoding

We map each nucleotide to an integer:
- A: 1, C: 2, G: 3, U: 4  
Unknown characters are mapped to 0.

In [None]:
nucleotide_map = {'A': 1, 'C': 2, 'G': 3, 'U': 4}

def encode_sequence(seq):
    """Encodes a RNA sequence into a list of integers based on nucleotide_map."""
    return [nucleotide_map.get(ch, 0) for ch in seq]

# Apply encoding to all sequence files
train_sequences['encoded'] = train_sequences['sequence'].apply(encode_sequence)
valid_sequences['encoded'] = valid_sequences['sequence'].apply(encode_sequence)
test_sequences['encoded'] = test_sequences['sequence'].apply(encode_sequence)

** Processing Label Data**

Each row in the labels CSV is for one residue, with an `ID` formatted as `target_id_resid`.
We group rows by `target_id` and sort by residue number.
Here, we use the first structure (x_1, y_1, z_1) as our target coordinates.

In [None]:
def process_labels(labels_df):
    """
    Processes a labels DataFrame by grouping rows by target_id.
    Returns a dictionary mapping target_id to an array of coordinates (seq_len, 3).
    """
    label_dict = {}
    for idx, row in labels_df.iterrows():
        # Split ID into target_id and residue number (assumes format "targetid_resid")
        parts = row['ID'].split('_')
        target_id = "_".join(parts[:-1])
        resid = int(parts[-1])
        # Extract the coordinates; they should be numeric (missing values already set to 0)
        coord = np.array([row['x_1'], row['y_1'], row['z_1']], dtype=np.float32)
        if target_id not in label_dict:
            label_dict[target_id] = []
        label_dict[target_id].append((resid, coord))
    
    # Sort residues by resid and stack coordinates
    for key in label_dict:
        sorted_coords = sorted(label_dict[key], key=lambda x: x[0])
        coords = np.stack([c for r, c in sorted_coords])
        label_dict[key] = coords
    return label_dict

# Process training and validation labels
train_labels_dict = process_labels(train_labels)
valid_labels_dict = process_labels(valid_labels)

In [None]:
nucleotide_map = {'A': 1, 'C': 2, 'G': 3, 'U': 4}

def encode_sequence(seq):
    return [nucleotide_map.get(ch, 0) for ch in seq]

train_sequences['encoded'] = train_sequences['sequence'].apply(encode_sequence)
valid_sequences['encoded'] = valid_sequences['sequence'].apply(encode_sequence)
test_sequences['encoded'] = test_sequences['sequence'].apply(encode_sequence)

def process_labels(labels_df):
    label_dict = {}
    for idx, row in labels_df.iterrows():
        parts = row['ID'].split('_')
        target_id = "_".join(parts[:-1])
        resid = int(parts[-1])
        coord = np.array([row['x_1'], row['y_1'], row['z_1']], dtype=np.float32)
        if target_id not in label_dict:
            label_dict[target_id] = []
        label_dict[target_id].append((resid, coord))
    for key in label_dict:
        sorted_coords = sorted(label_dict[key], key=lambda x: x[0])
        coords = np.stack([c for r, c in sorted_coords])
        label_dict[key] = coords
    return label_dict

train_labels_dict = process_labels(train_labels)
valid_labels_dict = process_labels(valid_labels)

def create_graph_dataset(sequences_df, labels_dict):
    X_list, A_list, y_list, target_ids = [], [], [], []
    for idx, row in sequences_df.iterrows():
        tid = row['target_id']
        if tid in labels_dict:
            seq = row['encoded']
            seq_len = len(seq)
            # Node features (n_nodes, n_features)
            X = np.array(seq, dtype=np.float32).reshape(-1, 1)  # Shape: [seq_len, 1]
            # Adjacency matrix for backbone (n_nodes, n_nodes)
            A = np.zeros((seq_len, seq_len), dtype=np.float32)
            for i in range(seq_len - 1):
                A[i, i + 1] = 1
                A[i + 1, i] = 1  # Undirected graph
            # Labels
            y = labels_dict[tid]  # Shape: [seq_len, 3]
            X_list.append(X)
            A_list.append(A)
            y_list.append(y)
            target_ids.append(tid)
    return X_list, A_list, y_list, target_ids

X_train, A_train, y_train, train_ids = create_graph_dataset(train_sequences, train_labels_dict)
X_valid, A_valid, y_valid, valid_ids = create_graph_dataset(valid_sequences, valid_labels_dict)

max_len = max(len(seq) for seq in X_train)
print("Maximum sequence length (train):", max_len)

def pad_graph(X, A, y, max_len):
    seq_len = X.shape[0]
    if seq_len < max_len:
        # Pad node features
        X_pad = np.pad(X, ((0, max_len - seq_len), (0, 0)), mode='constant', constant_values=0)
        # Pad adjacency matrix
        A_pad = np.pad(A, ((0, max_len - seq_len), (0, max_len - seq_len)), mode='constant', constant_values=0)
        # Pad coordinates
        y_pad = np.pad(y, ((0, max_len - seq_len), (0, 0)), mode='constant', constant_values=0)
    else:
        X_pad, A_pad, y_pad = X, A, y
    return X_pad, A_pad, y_pad

X_train_pad = []
A_train_pad = []
y_train_pad = []
for x, a, y in zip(X_train, A_train, y_train):
    x_p, a_p, y_p = pad_graph(x, a, y, max_len)
    X_train_pad.append(x_p)
    A_train_pad.append(a_p)
    y_train_pad.append(y_p)

X_valid_pad = []
A_valid_pad = []
y_valid_pad = []
for x, a, y in zip(X_valid, A_valid, y_valid):
    x_p, a_p, y_p = pad_graph(x, a, y, max_len)
    X_valid_pad.append(x_p)
    A_valid_pad.append(a_p)
    y_valid_pad.append(y_p)

X_train_pad = np.array(X_train_pad)  # Shape: [n_samples, max_len, 1]
A_train_pad = np.array(A_train_pad)  # Shape: [n_samples, max_len, max_len]
y_train_pad = np.array(y_train_pad)  # Shape: [n_samples, max_len, 3]
X_valid_pad = np.array(X_valid_pad)
A_valid_pad = np.array(A_valid_pad)
y_valid_pad = np.array(y_valid_pad)

print("X_train_pad shape:", X_train_pad.shape)
print("A_train_pad shape:", A_train_pad.shape)
print("y_train_pad shape:", y_train_pad.shape)

**Creating Datasets and Padding**

We match each target sequence with its corresponding coordinate labels.
Then we pad sequences and coordinate arrays to a uniform length.

Padded positions in coordinates are set to 0.

In [None]:
def create_dataset(sequences_df, labels_dict):
    """
    Creates a dataset from a sequences DataFrame and a labels dictionary.
    Returns:
        X: list of encoded sequences,
        y: list of coordinate arrays,
        target_ids: list of target ids.
    """
    X, y, target_ids = [], [], []
    for idx, row in sequences_df.iterrows():
        tid = row['target_id']
        if tid in labels_dict:
            X.append(row['encoded'])
            y.append(labels_dict[tid])
            target_ids.append(tid)
    return X, y, target_ids

# Create training and validation datasets
X_train, y_train, train_ids = create_dataset(train_sequences, train_labels_dict)
X_valid, y_valid, valid_ids = create_dataset(valid_sequences, valid_labels_dict)

# Determine maximum sequence length from training set
max_len = max(len(seq) for seq in X_train)
print("Maximum sequence length (train):", max_len)

# Pad the sequences (padding value = 0)
X_train_pad = pad_sequences(X_train, maxlen=max_len, padding='post', value=0)
X_valid_pad = pad_sequences(X_valid, maxlen=max_len, padding='post', value=0)

# Function to pad coordinate arrays
def pad_coordinates(coord_array, max_len):
    L = coord_array.shape[0]
    if L < max_len:
        pad_width = ((0, max_len - L), (0, 0))
        return np.pad(coord_array, pad_width, mode='constant', constant_values=0)
    else:
        return coord_array

# Pad coordinate arrays
y_train_pad = np.array([pad_coordinates(arr, max_len) for arr in y_train])
y_valid_pad = np.array([pad_coordinates(arr, max_len) for arr in y_valid])

# Check for any NaN values in the targets
print("Any NaN in y_train_pad?", np.isnan(y_train_pad).any())
print("Any NaN in y_valid_pad?", np.isnan(y_valid_pad).any())

print("X_train_pad shape:", X_train_pad.shape)
print("y_train_pad shape:", y_train_pad.shape)

## 3.1 Basic CNN Model Training

In this section, we build a basic CNN-based model.
The model uses:
- An Embedding layer  
- Two Conv1D blocks (with BatchNormalization and Dropout)  
- A final Conv1D layer (kernel size 1) to output 3 coordinates per residue

In [None]:
# Define hyperparameters for the CNN model
vocab_size = max(nucleotide_map.values()) + 1  # +1 for padding token 0
embedding_dim = 16
num_filters = 64
kernel_size = 3
drop_rate = 0.2

# Build the CNN model
input_seq_cnn = Input(shape=(max_len,), name='input_seq')
x_cnn = Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True, name='embedding')(input_seq_cnn)

# First convolutional block
x_cnn = Conv1D(filters=num_filters, kernel_size=kernel_size, padding='same', activation='relu', name='conv1')(x_cnn)
x_cnn = BatchNormalization(name='bn1')(x_cnn)
x_cnn = Dropout(drop_rate, name='drop1')(x_cnn)

# Second convolutional block
x_cnn = Conv1D(filters=num_filters, kernel_size=kernel_size, padding='same', activation='relu', name='conv2')(x_cnn)
x_cnn = BatchNormalization(name='bn2')(x_cnn)
x_cnn = Dropout(drop_rate, name='drop2')(x_cnn)

# Final convolution to output 3 coordinates per residue (x, y, z)
output_coords_cnn = Conv1D(filters=3, kernel_size=1, padding='same', activation='linear', name='predicted_coords')(x_cnn)

cnn_model = Model(inputs=input_seq_cnn, outputs=output_coords_cnn)
cnn_model.compile(optimizer='adam', loss='mse')

cnn_model.summary()

## 3.2 Model Training

We train the CNN model using early stopping to monitor the validation loss.
With the NaN issues addressed in the data, training should proceed without nan losses.

In [None]:
early_stop_cnn = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

history_cnn = cnn_model.fit(X_train_pad, y_train_pad,
                            validation_data=(X_valid_pad, y_valid_pad),
                            epochs=50,
                            batch_size=16,
                            callbacks=[early_stop_cnn],
                            verbose=1)

# Plot training and validation loss
plt.figure(figsize=(8, 5))
plt.plot(history_cnn.history['loss'], label='Train Loss (CNN)')
plt.plot(history_cnn.history['val_loss'], label='Val Loss (CNN)')
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("CNN Model Training vs. Validation Loss")
plt.legend()
plt.show()

# 4. GCN model building and training

In [None]:
# Hyperparameters
vocab_size = max(nucleotide_map.values()) + 1  # 5 (0 for padding)
embedding_dim = 16
gcn_units = 64
drop_rate = 0.2

# Build GCN model
input_nodes = Input(shape=(max_len, 1), name='input_nodes')  # Node features
input_adj = Input(shape=(max_len, max_len), name='input_adj')  # Adjacency matrix

# Embedding layer
x = Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True)(input_nodes)  # [n_samples, max_len, embedding_dim]
x = tf.squeeze(x, axis=2)  # Remove feature dim if 1, now [n_samples, max_len, embedding_dim]

# GCN layers
x = GCNConv(gcn_units, activation='relu')([x, input_adj])
x = Dropout(drop_rate)(x)
x = GCNConv(gcn_units, activation='relu')([x, input_adj])
x = Dropout(drop_rate)(x)

# Output layer: predict 3 coordinates per node
output_coords = Dense(3, activation='linear', name='predicted_coords')(x)  # [n_samples, max_len, 3]

gcn_model = Model(inputs=[input_nodes, input_adj], outputs=output_coords)

# Custom masked loss to ignore padded regions
def masked_mse(y_true, y_pred):
    mask = tf.cast(tf.reduce_any(tf.not_equal(y_true, 0), axis=-1), tf.float32)
    mse = tf.keras.losses.mean_squared_error(y_true, y_pred)
    masked_mse = tf.reduce_sum(mse * mask) / tf.reduce_sum(mask)
    return masked_mse

gcn_model.compile(optimizer='adam', loss=masked_mse)
gcn_model.summary()

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

history = gcn_model.fit(
    [X_train_pad, A_train_pad], y_train_pad,
    validation_data=([X_valid_pad, A_valid_pad], y_valid_pad),
    epochs=100,
    batch_size=32,
    callbacks=[early_stopping],
    verbose=1
)

# Plot training history
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

## 5. Generating Predictions and Submission File

For each test sequence, we predict the 3D coordinates using our trained CNN model.

The submission requires 5 sets of coordinates per target. In this baseline, we replicate the same predicted structure 5 times.

In [None]:
# Prepare test data: pad sequences to same length as training
X_test = test_sequences['encoded'].tolist()
X_test_pad = pad_sequences(X_test, maxlen=max_len, padding='post', value=0)

# Predict coordinates using the trained CNN model
predictions = cnn_model.predict(X_test_pad)

# Build submission rows. Each row corresponds to a residue from a test target.
submission_rows = []
for idx, row in test_sequences.iterrows():
    target_id = row['target_id']
    # Get predicted coordinates (shape: [max_len, 3])
    pred_coords = predictions[idx]
    # Determine actual sequence length
    seq_length = len(row['encoded'])
    pred_coords = pred_coords[:seq_length, :]  # only actual residues
    
    # For each residue, create a row in the submission file
    for i in range(seq_length):
        coords = pred_coords[i, :]
        # Replicate the same prediction 5 times for submission format
        submission_rows.append({
            'ID': f"{target_id}_{i+1}",
            'resname': row['sequence'][i],
            'resid': i+1,
            **{f"x_{j+1}": coords[0] for j in range(5)},
            **{f"y_{j+1}": coords[1] for j in range(5)},
            **{f"z_{j+1}": coords[2] for j in range(5)}
        })

submission_df = pd.DataFrame(submission_rows)
print("Submission DataFrame shape:", submission_df.shape)
print(submission_df.head(10))

generating predictions for GCN model 

In [None]:
# Prepare test data
X_test = test_sequences['encoded'].tolist()
A_test = [np.zeros((len(seq), len(seq))) for seq in X_test]
for i, seq in enumerate(X_test):
    for j in range(len(seq) - 1):
        A_test[i][j, j + 1] = 1
        A_test[i][j + 1, j] = 1

X_test_pad = []
A_test_pad = []
for x, a in zip(X_test, A_test):
    x_p = np.pad(x, (0, max_len - len(x)), mode='constant', constant_values=0).reshape(-1, 1)
    a_p = np.pad(a, ((0, max_len - len(a)), (0, max_len - len(a))), mode='constant', constant_values=0)
    X_test_pad.append(x_p)
    A_test_pad.append(a_p)

X_test_pad = np.array(X_test_pad)
A_test_pad = np.array(A_test_pad)

# Predict coordinates
predictions = gcn_model.predict([X_test_pad, A_test_pad])

# Build submission
submission_rows = []
for idx, row in test_sequences.iterrows():
    target_id = row['target_id']
    pred_coords = predictions[idx]
    seq_length = len(row['encoded'])
    pred_coords = pred_coords[:seq_length, :]
    
    for i in range(seq_length):
        coords = pred_coords[i, :]
        submission_rows.append({
            'ID': f"{target_id}_{i+1}",
            'resname': row['sequence'][i],
            'resid': i+1,
            **{f"x_{j+1}": coords[0] for j in range(5)},
            **{f"y_{j+1}": coords[1] for j in range(5)},
            **{f"z_{j+1}": coords[2] for j in range(5)}
        })

submission_df = pd.DataFrame(submission_rows)
print("Submission DataFrame shape:", submission_df.shape)
print(submission_df.head(10))

## 6. Saving the Submission File

Finally, we save the submission file as `submission.csv`.

In [None]:
submission_df.to_csv("submission.csv", index=False)
print("Submission file saved as submission.csv")