---
## **<p style="text-align: center; text-decoration: underline;">DATA CHALLENGE</p>**
# **<p style="text-align: center;">HUMAN MOTION DESCRIPTION (HMD): Motion-To-Text</p>**
---

> *2025*.

---

![examples](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fimg.clipart-library.com%2F2%2Fclip-motions%2Fclip-motions-6.png&f=1&nofb=1&ipt=0747ffa645bb5f7798e8a2d44499b28f1156ce0e83b1b300fabfed4c6ab1fdf2&ipo=images)

### ■ **Overview**
In this data challenge, you will explore the intersection of natural language processing (NLP) and human motion synthesis by working on text-to-motion and motion-to-text tasks using the HumanML3D dataset. This dataset contains 3D human motion sequences paired with rich textual descriptions, enabling models to learn bidirectional mappings between language and motion.

#### **I. Main Task: Motion-To-Text & Text-to-Motion Generation**
- **Motion-to-Text:** Develop a model to describe human motions in natural language given a sequence of 3D poses.

#### **II. Dataset Overview:**
- HumanML3D includes 14,616 motion samples across diverse actions (walking, dancing, sports) and 44,970 text annotations.
- Data includes skeletal joint positions, rotations, and fine-grained textual descriptions.

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fproduction-media.paperswithcode.com%2Fdatasets%2F446194c5-ce59-43eb-b4cb-570a7a4d0cd9.png&f=1&nofb=1&ipt=b2edbe3251cab88e26a7f9d4e765c811b2cc890dc2ace7f7456baeca076b115b&ipo=images" alt="description" style="width:800px; height:600px;" />

The provided dataset contains the following components:

- 1. `motions` Folder: Contains `.npy` files, each representing a sequence of body poses. Each file has a shape of `(T, N, d)`, where:
  - `T`: Number of frames in the sequence (varies across sequences).
  - `N`: Number of joints in the body (22 in this case).
  - `d`: Dimension of each joint (3D coordinates: `x`, `y`, `z`).

- 2. `texts` Folder: Contains `.npy` files, each providing **3/4 textual descriptions** of the corresponding motion sequence. Each description is accompanied by part-of-speech (POS) tags for every word in the description. Example: "a person jump hop to the right#a/DET person/NOUN jump/NOUN hop/NOUN to/ADP the/DET right/NOUN#"

- 3. File Lists
    - **`train.txt`**: List of motion files for training.
    - **`val.txt`**: List of motion files for validation.
    - **`test.txt`**: List of motion files for testing.

#### **III. Evaluation Metrics**

**Similarity Score:** computes the similarity score between the predicted text and ground truth texts.
> Note: Higher similarity (closer to 1 or 100\%) indicate better text-motion alignment.

Solutions should be submitted in the following format (in a csv file):

For each ID in the motion test set (`test.txt`), you must predict the corresponding description. The file should contain a header and have the following format:

| id      | text                                                                 |
|---------|---------------------------------------------------------------------|
| 004822  | A person walks slowly forward, swinging their arms naturally        |
| 014457  | Someone performs a golf swing with proper form                      |
| 009613  | An individual jogs backwards diagonally across the room             |
| 008463  | A man bends down to pick up an object while walking                 |
| 012365  | A dancer spins clockwise while raising both arms                    |
| 007933  | Two people engage in a slow-motion martial arts demonstration       |
| 003430  | A child skips happily across a playground                           |
| 014522  | An athlete performs a perfect cartwheel sequence                    |
| 005698  | A woman gracefully practices yoga sun salutations                   |
| 001664  | A parkour expert vaults over a low wall                             |

You can generate your submission files using pandas as follows:

    >>> submission = pd.DataFrame({
    ...     'id': ['004822', '014457', ...],
    ...     'text': [
    ...         "a person walking slowly",
    ...         "someone swinging a golf club",
    ...         ...
    ...     ]
    ... })
    ... submission.to_csv('./submission.csv', index=False)
    
#### **References**

- Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2023). Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems, 36, 20067-20079.
- Zhu, W., Ma, X., Ro, D., Ci, H., Zhang, J., Shi, J., ... & Wang, Y. (2023). Human motion generation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Xu, L., Song, Z., Wang, D., Su, J., Fang, Z., Ding, C., ... & Wu, W. (2023). Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2228-2238).

### **Animation Demo**

In [2]:
import os
from os.path import join as pjoin
from tqdm import tqdm
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.animation import FuncAnimation, PillowWriter
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import mpl_toolkits.mplot3d.axes3d as p3

# Define the kinematic tree for connecting joints
kinematic_tree = [
    [0, 2, 5, 8, 11], 
    [0, 1, 4, 7, 10], 
    [0, 3, 6, 9, 12, 15], 
    [9, 14, 17, 19, 21], 
    [9, 13, 16, 18, 20]
]

def plot_3d_motion(save_path, joints, title, figsize=(10, 10), fps=120, radius=4):
    # Split the title if it's too long
    title_sp = title.split(' ')
    if len(title_sp) > 10:
        title = '\n'.join([' '.join(title_sp[:10]), ' '.join(title_sp[10:])])

    def init():
        ax.set_xlim3d([-radius / 2, radius / 2])
        ax.set_ylim3d([0, radius])
        ax.set_zlim3d([0, radius])
        fig.suptitle(title, fontsize=20)
        ax.grid(b=False)

    def plot_xzPlane(minx, maxx, miny, minz, maxz):
        # Plot a plane XZ
        verts = [
            [minx, miny, minz],
            [minx, miny, maxz],
            [maxx, miny, maxz],
            [maxx, miny, minz]
        ]
        xz_plane = Poly3DCollection([verts])
        xz_plane.set_facecolor((0.5, 0.5, 0.5, 0.5))
        ax.add_collection3d(xz_plane)

    # Reshape the joints data
    data = joints.copy().reshape(len(joints), -1, 3)
    # fig = plt.figure(figsize=figsize)
    # ax = p3.Axes3D(fig)
    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(111, projection='3d')
    init()

    # Compute min and max values for the data
    MINS = data.min(axis=0).min(axis=0)
    MAXS = data.max(axis=0).max(axis=0)

    # Define colors for the kinematic tree
    colors = ['red', 'blue', 'black', 'red', 'blue',  
              'darkblue', 'darkblue', 'darkblue', 'darkblue', 'darkblue',
              'darkred', 'darkred', 'darkred', 'darkred', 'darkred']

    frame_number = data.shape[0]

    # Adjust the height offset
    height_offset = MINS[1]
    data[:, :, 1] -= height_offset
    trajec = data[:, 0, [0, 2]]

    # Center the data
    data[..., 0] -= data[:, 0:1, 0]
    data[..., 2] -= data[:, 0:1, 2]

    def update(index):
        # Clear existing lines and collections
        for line in ax.lines:
            line.remove()
        for collection in ax.collections:
            collection.remove()

        # Update the view
        ax.view_init(elev=120, azim=-90)
        ax.dist = 7.5

        # Plot the XZ plane
        plot_xzPlane(MINS[0] - trajec[index, 0], MAXS[0] - trajec[index, 0], 0, MINS[2] - trajec[index, 1], MAXS[2] - trajec[index, 1])

        # Plot the trajectory
        if index > 1:
            ax.plot3D(trajec[:index, 0] - trajec[index, 0], np.zeros_like(trajec[:index, 0]), trajec[:index, 1] - trajec[index, 1], linewidth=1.0, color='blue')

        # Plot the kinematic tree
        for i, (chain, color) in enumerate(zip(kinematic_tree, colors)):
            linewidth = 4.0 if i < 5 else 2.0
            ax.plot3D(data[index, chain, 0], data[index, chain, 1], data[index, chain, 2], linewidth=linewidth, color=color)
        # Hide axis labels
        plt.axis('off')
        ax.set_xticklabels([])
        ax.set_yticklabels([])
        ax.set_zticklabels([])

    # Create the animation
    ani = FuncAnimation(fig, update, frames=frame_number, interval=1000 / fps, repeat=False)

    # Save the animation
    ani.save(save_path, fps=fps)
    plt.close()

    print(f'Animation saved to {save_path}!')

In [3]:
## /!\ attention ! travaux: path to data -> replace this with your own paths
motion_data_dir = '/kaggle/input/human-motion-description-hmd-motion-to-text/motions/'
text_data_dir = '/kaggle/input/human-motion-description-hmd-motion-to-text/texts/' 

## list all files in the folder
npy_files = sorted(os.listdir(motion_data_dir))

## pick a random motion file
npy_file = np.random.choice(npy_files)

## read npy motion file
motion_data = np.load(os.path.join(motion_data_dir, npy_file))
print('shape', motion_data.shape)

## get the corresponding titles for the given motion
titles = []
with open('{}{}.txt'.format(text_data_dir, npy_file.split('.')[0])) as f:
    descriptions = f.readlines()
    for desc in descriptions:
        titles.append(desc.split('#')[0].capitalize())

print('Descriptions:')
print('- '+'\n- '.join(titles))

## pick a random title
title = np.random.choice(titles)

## create & save animation
save_path = './animation.gif'
plot_3d_motion(save_path, motion_data, title=title, figsize=(10, 6), fps=30, radius=4)

shape (100, 22, 3)
Descriptions:
- A man walks up steps with his left hand on the railing.
- A person slowly walked upstairs
- While holding on to a rail with his left hand a person climbs up stairs.


  ax.dist = 7.5


Animation saved to ./animation.gif!


In [None]:
import os
import torch
from PIL import Image
from torchvision import transforms

def extract_and_stack_frames(gif_path):
    """
    Extract frames from a GIF, preprocess them, and return a stacked tensor.
    
    Args:
    - gif_path (str): Path to the input GIF file.
    
    Returns:
    - frames_tensor (Tensor): Tensor of shape (T, 3, 224, 224), where:
        - T = number of frames
        - 3 = RGB channels
        - 224x224 = Resized image for ClipBERT
    """
    gif = Image.open(gif_path)
    frames = []

    # Define the transformation (Resize + Normalize)
    transform = transforms.Compose([
        transforms.Resize((224, 224)),  # Resize for ClipBERT
        transforms.ToTensor(),  # Convert to tensor
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize
    ])

    while True:
        # Convert frame to RGB and apply transformation
        frame = gif.convert("RGB")
        frame_tensor = transform(frame)
        frames.append(frame_tensor)
        
        try:
            gif.seek(gif.tell() + 1)  # Move to the next frame
        except EOFError:
            break  # No more frames

    # Stack frames along the batch dimension
    frames_tensor = torch.stack(frames)  # Shape: (T, 3, 224, 224)

    return frames_tensor
    
# Example usage
gif_path = "animation.gif"
frames  = extract_and_stack_frames(gif_path)


In [4]:
from dataset_dataloader import *
data_dir = '/kaggle/input/human-motion-description-hmd-motion-to-text/'
train_set = MotionDataset(data_dir, 'train.txt', mean=None, std=None)
valid_set = MotionDataset(data_dir, 'val.txt', mean=None, std=None)

batch_size = 64
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size)


for motion, text in train_loader:
    print('motion shape:', motion.shape)
    print('exemple of texts:', text[0])
    break


loading data...: 100%|██████████| 13012/13012 [02:32<00:00, 85.20it/s]
loading data...: 100%|██████████| 3254/3254 [00:37<00:00, 86.82it/s]

motion shape: torch.Size([64, 100, 22, 3])
exemple of texts: person scratches their face with left hand then right hand





In [6]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the kinematic tree
kinematic_tree = [
    [0, 2, 5, 8, 11], 
    [0, 1, 4, 7, 10], 
    [0, 3, 6, 9, 12, 15], 
    [9, 14, 17, 19, 21], 
    [9, 13, 16, 18, 20]
]

# Number of joints
num_joints = 22

# Build adjacency matrix
adj_matrix = np.zeros((num_joints, num_joints), dtype=np.float32)
for branch in kinematic_tree:
    for i in range(len(branch) - 1):
        adj_matrix[branch[i], branch[i + 1]] = 1
        adj_matrix[branch[i + 1], branch[i]] = 1  # Undirected graph

# Convert to PyTorch tensor
adj_matrix = torch.tensor(adj_matrix)

## **GCN basic**

In [25]:
class GCNEncoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, adj_matrix):
        super(GCNEncoder, self).__init__()
        self.adj_matrix = adj_matrix
        self.gcn1 = nn.Linear(input_dim, hidden_dim)
        self.gcn2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # x shape: (batch_size, T, N, d)
        batch_size, T, N, d = x.shape
        
        # Reshape for GCN: (batch_size * T, N, d)
        x = x.view(-1, N, d)
        
        # GCN Layer 1
        x = F.relu(torch.matmul(self.adj_matrix, self.gcn1(x)))
        
        # GCN Layer 2
        x = torch.matmul(self.adj_matrix, self.gcn2(x))
        
        # Reshape back: (batch_size, T, N, output_dim)
        x = x.view(batch_size, T, N, -1)
        
        # Aggregate over joints and time: (batch_size, output_dim)
        x = x.mean(dim=1).mean(dim=1)
        
        return x

## **GCN improve**

In [None]:
class GCNEncoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, adj_matrix):
        super(GCNEncoder, self).__init__()
        self.adj_matrix = adj_matrix
        self.gcn1 = nn.Linear(input_dim, hidden_dim)
        self.gcn2 = nn.Linear(hidden_dim, hidden_dim)
        self.gcn3 = nn.Linear(hidden_dim, output_dim)
        self.bn1 = nn.BatchNorm1d(hidden_dim)  # BatchNorm for GCN Layer 1
        self.bn2 = nn.BatchNorm1d(hidden_dim)  # BatchNorm for GCN Layer 2
        self.bn3 = nn.BatchNorm1d(output_dim)  # BatchNorm for GCN Layer 3
        
    def forward(self, x):
        # x shape: (batch_size, T, N, d)
        batch_size, T, N, d = x.shape
        
        # Reshape for GCN: (batch_size * T, N, d)
        x = x.view(-1, N, d)
        
        # GCN Layer 1
        x = torch.matmul(self.adj_matrix, self.gcn1(x))  # Shape: (batch_size * T, N, hidden_dim)
        x = x.view(-1, self.gcn1.out_features)  # Reshape for BatchNorm: (batch_size * T * N, hidden_dim)
        x = self.bn1(x)  # Apply BatchNorm
        x = x.view(-1, N, self.gcn1.out_features)  # Reshape back: (batch_size * T, N, hidden_dim)
        x = F.relu(x)
        
        # GCN Layer 2
        x = torch.matmul(self.adj_matrix, self.gcn2(x))  # Shape: (batch_size * T, N, hidden_dim)
        x = x.view(-1, self.gcn2.out_features)  # Reshape for BatchNorm: (batch_size * T * N, hidden_dim)
        x = self.bn2(x)  # Apply BatchNorm
        x = x.view(-1, N, self.gcn2.out_features)  # Reshape back: (batch_size * T, N, hidden_dim)
        x = F.relu(x)
        
        # GCN Layer 3
        x = torch.matmul(self.adj_matrix, self.gcn3(x))  # Shape: (batch_size * T, N, output_dim)
        x = x.view(-1, self.gcn3.out_features)  # Reshape for BatchNorm: (batch_size * T * N, output_dim)
        x = self.bn3(x)  # Apply BatchNorm
        x = x.view(-1, N, self.gcn3.out_features)  # Reshape back: (batch_size * T, N, output_dim)
        
        # Reshape back: (batch_size, T, N, output_dim)
        x = x.view(batch_size, T, N, -1)
        
        # Aggregate over joints and time: (batch_size, output_dim)
        x = x.mean(dim=1).mean(dim=1)
        
        return x

## **GCN with dropout**

In [None]:
class GCNEncoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, adj_matrix):
        super(GCNEncoder, self).__init__()
        self.adj_matrix = adj_matrix
        self.gcn1 = nn.Linear(input_dim, hidden_dim)
        self.gcn2 = nn.Linear(hidden_dim, hidden_dim)
        self.gcn3 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(0.5)  # Add dropout
        
    def forward(self, x):
        # x shape: (batch_size, T, N, d)
        batch_size, T, N, d = x.shape
        
        # Reshape for GCN: (batch_size * T, N, d)
        x = x.view(-1, N, d)
        
        # GCN Layer 1
        x = F.relu(torch.matmul(self.adj_matrix, self.gcn1(x)))
        x = self.dropout(x)  # Apply dropout
        
        # GCN Layer 2
        x = F.relu(torch.matmul(self.adj_matrix, self.gcn2(x)))
        x = self.dropout(x)  # Apply dropout
        
        # GCN Layer 3
        x = torch.matmul(self.adj_matrix, self.gcn3(x))
        
        # Reshape back: (batch_size, T, N, output_dim)
        x = x.view(batch_size, T, N, -1)
        
        # Aggregate over joints and time: (batch_size, output_dim)
        x = x.mean(dim=1).mean(dim=1)
        
        return x

## STGCN

In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class STGCNLayer(nn.Module):
    def __init__(self, in_channels, out_channels, edge_index):
        super(STGCNLayer, self).__init__()
        self.gcn = GCNConv(in_channels, out_channels)
        self.temporal_conv = nn.Conv1d(out_channels, out_channels, kernel_size=3, padding=1)
        self.edge_index = edge_index
        self.bn = nn.BatchNorm1d(out_channels)

    def forward(self, x):
        # Input shape: (batch_size, T, N, in_channels)
        batch_size, T, N, in_channels = x.shape

        # Spatial GCN ----------------------------------------------------------
        x = x.reshape(batch_size * T * N, in_channels)  # (batch*T*N, in_channels)
        x = self.gcn(x, self.edge_index)                # (batch*T*N, out_channels)
        x = x.reshape(batch_size, T, N, -1)             # (batch, T, N, out_channels)

        # Temporal Convolution --------------------------------------------------
        x = x.permute(0, 3, 1, 2)  # (batch, out_channels, T, N)
        x = x.reshape(batch_size, self.gcn.out_channels, T * N)
        x = self.temporal_conv(x)  # (batch, out_channels, T*N)
        x = x.reshape(batch_size, self.gcn.out_channels, T, N)
        x = x.permute(0, 2, 3, 1)  # (batch, T, N, out_channels) <-- Critical fix

        # Batch Normalization --------------------------------------------------
        x = x.reshape(-1, self.gcn.out_channels)  # (batch*T*N, out_channels)
        x = self.bn(x)
        x = x.reshape(batch_size, T, N, -1)       # (batch, T, N, out_channels)
        return F.relu(x)


In [11]:
# Example input
batch_size, T, N, d = 16, 10, 22, 3  # Input dimensions
x = torch.randn(batch_size, T, N, d)  # Random input

# Define edge_index (from previous code)
edge_index = torch.tensor([[0, 1, 2], [1, 2, 3]], dtype=torch.long)  # Example edge_index

# Initialize STGCNLayer
st_gcn_layer = STGCNLayer(in_channels=d, out_channels=64, edge_index=edge_index)

# Forward pass
output = st_gcn_layer(x)
print(output.shape)  # Should be (batch_size, out_channels, T, N)

torch.Size([16, 10, 22, 64])


In [12]:
import torch.nn as nn
import torch.nn.functional as F

class STGCNEncoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, edge_index):
        super(STGCNEncoder, self).__init__()
        self.output_dim = output_dim
        self.edge_index = edge_index
        
        # ST-GCN layers
        self.st_gcn1 = STGCNLayer(input_dim, hidden_dim, edge_index)
        self.st_gcn2 = STGCNLayer(hidden_dim, hidden_dim, edge_index)
        self.st_gcn3 = STGCNLayer(hidden_dim, output_dim, edge_index)
        
    def forward(self, x):
        # Input shape: (batch_size, T, N, input_dim)
        x = self.st_gcn1(x)  # Output: (batch, T, N, hidden_dim)
        residual = x
        x = self.st_gcn2(x)  # Output: (batch, T, N, hidden_dim)
        x = x + residual     # Residual connection
        x = self.st_gcn3(x)  # Output: (batch, T, N, output_dim)
        
        # Aggregate over time (T) and joints (N)
        x = x.mean(dim=1).mean(dim=1)  # Shape: (batch_size, output_dim)
        return x  # Fixed: Added layer
        
    """def forward(self, x):
        # x shape: (batch_size, T, N, d)
        print("Input shape:", x.shape)  # Debug
        x = self.st_gcn1(x)
        print("After st_gcn1:", x.shape)  # Should be (16, 64, 10, 22)
        residual = x
        x = self.st_gcn2(x)
        print("After st_gcn2:", x.shape)  # Should be (16, 64, 10, 22)
        x = x + residual  # Residual connection
        print("After residual:", x.shape)  # Should be (16, 64, 10, 22)
        x = self.st_gcn3(x)
        print("After st_gcn3:", x.shape)  # Should be (16, 128, 10, 22)
        x = x.mean(dim=2).mean(dim=2)  # Aggregate over T and N
        print("Final output:", x.shape)  # Should be (16, 128)
        return x"""

In [13]:
# Test input
batch_size, T, N, d = 16, 10, 22, 3
x = torch.randn(batch_size, T, N, d)

# Define edge_index (replace with your kinematic tree)
edges = []
kinematic_tree = [[0, 2, 5, 8, 11], [0, 1, 4, 7, 10], [0, 3, 6, 9, 12, 15], [9, 14, 17, 19, 21], [9, 13, 16, 18, 20]]
for branch in kinematic_tree:
    for i in range(len(branch) - 1):
        edges.append([branch[i], branch[i+1]])
        edges.append([branch[i+1], branch[i]])  # Undirected edges
edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()

# Initialize encoder
encoder = STGCNEncoder(input_dim=3, hidden_dim=64, output_dim=128, edge_index=edge_index)

# Forward pass
output = encoder(x)
print("Encoder output shape:", output.shape)  # Should be (16, 128)

Encoder output shape: torch.Size([16, 128])


In [14]:
import numpy as np

# Define kinematic_tree
kinematic_tree = [
    [0, 2, 5, 8, 11], 
    [0, 1, 4, 7, 10], 
    [0, 3, 6, 9, 12, 15], 
    [9, 14, 17, 19, 21], 
    [9, 13, 16, 18, 20]
]

# Convert to edge indices
edges = []
for branch in kinematic_tree:
    for i in range(len(branch) - 1):
        edges.append([branch[i], branch[i + 1]])
        edges.append([branch[i + 1], branch[i]])  # Undirected graph

edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()

## **Motion to text Model for simple GCN**

In [26]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

class MotionToTextModel(nn.Module):
    def __init__(self, gcn_encoder, t5_model_name='t5-small'):
        super(MotionToTextModel, self).__init__()
        self.gcn_encoder = gcn_encoder
        self.t5 = T5ForConditionalGeneration.from_pretrained(t5_model_name)
        self.tokenizer = T5Tokenizer.from_pretrained(t5_model_name)
        
        # Project motion encoding to T5 embedding dimension
        t5_embedding_dim = self.t5.config.d_model
        self.projection = nn.Linear(gcn_encoder.gcn2.out_features, t5_embedding_dim)
        
    def forward(self, motion, target_text=None):
        # Encode motion: (batch_size, output_dim)
        motion_encoded = self.gcn_encoder(motion)
        
        # Project motion encoding to T5 embedding dimension: (batch_size, t5_embedding_dim)
        motion_encoded = self.projection(motion_encoded)
        
        # Prepare input for T5
        input_ids = self.tokenizer(
            "motion to text: ", return_tensors="pt"
        ).input_ids.to(motion.device)
        
        # Repeat input_ids for the batch size
        input_ids = input_ids.repeat(motion.size(0), 1)
        
        # Get T5 input embeddings: (batch_size, sequence_length, t5_embedding_dim)
        decoder_inputs_embeds = self.t5.get_input_embeddings()(input_ids)
        
        # Concatenate motion encoding with T5 input embeddings
        motion_encoded = motion_encoded.unsqueeze(1)  # (batch_size, 1, t5_embedding_dim)
        decoder_inputs_embeds = torch.cat([motion_encoded, decoder_inputs_embeds], dim=1)
        
        # Generate text
        if target_text is not None:
            # Training mode
            labels = self.tokenizer(
                target_text, return_tensors="pt", padding=True, truncation=True
            ).input_ids.to(motion.device)
            outputs = self.t5(inputs_embeds=decoder_inputs_embeds, labels=labels)
            return outputs.loss
        else:
            # Inference mode
            generated_ids = self.t5.generate(inputs_embeds=decoder_inputs_embeds)
            generated_text = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            return generated_text

## **Motion to text model for multilayer GCN**

In [15]:
class MotionToTextModel(nn.Module):
    def __init__(self, gcn_encoder, t5_model_name='t5-small'):
        super(MotionToTextModel, self).__init__()
        self.gcn_encoder = gcn_encoder
        self.t5 = T5ForConditionalGeneration.from_pretrained(t5_model_name)
        self.tokenizer = T5Tokenizer.from_pretrained(t5_model_name)
        
        # Project motion encoding to T5 embedding dimension
        t5_embedding_dim = self.t5.config.d_model
        self.projection = nn.Linear(gcn_encoder.gcn3.out_features, t5_embedding_dim)  # Use gcn3.out_features
        
    def forward(self, motion, target_text=None):
        # Encode motion: (batch_size, output_dim)
        motion_encoded = self.gcn_encoder(motion)
        
        # Project motion encoding to T5 embedding dimension: (batch_size, t5_embedding_dim)
        motion_encoded = self.projection(motion_encoded)
        
        # Prepare input for T5
        input_ids = self.tokenizer(
            "motion to text: ", return_tensors="pt"
        ).input_ids.to(motion.device)
        
        # Repeat input_ids for the batch size
        input_ids = input_ids.repeat(motion.size(0), 1)
        
        # Get T5 input embeddings: (batch_size, sequence_length, t5_embedding_dim)
        decoder_inputs_embeds = self.t5.get_input_embeddings()(input_ids)
        
        # Concatenate motion encoding with T5 input embeddings
        motion_encoded = motion_encoded.unsqueeze(1)  # (batch_size, 1, t5_embedding_dim)
        decoder_inputs_embeds = torch.cat([motion_encoded, decoder_inputs_embeds], dim=1)
        
        # Generate text
        if target_text is not None:
            # Training mode
            labels = self.tokenizer(
                target_text, return_tensors="pt", padding=True, truncation=True
            ).input_ids.to(motion.device)
            outputs = self.t5(inputs_embeds=decoder_inputs_embeds, labels=labels)
            return outputs.loss
        else:
            # Inference mode
            generated_ids = self.t5.generate(inputs_embeds=decoder_inputs_embeds)
            generated_text = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            return generated_text

## Motion To Text for STGCN

In [16]:
import torch
import torch.nn as nn
from transformers import T5ForConditionalGeneration, T5Tokenizer

class MotionToTextModel(nn.Module):
    def __init__(self, gcn_encoder, t5_model_name='t5-small'):
        super(MotionToTextModel, self).__init__()
        self.gcn_encoder = gcn_encoder
        self.t5 = T5ForConditionalGeneration.from_pretrained(t5_model_name)
        self.tokenizer = T5Tokenizer.from_pretrained(t5_model_name)
        
        # Project motion encoding to T5 embedding dimension
        t5_embedding_dim = self.t5.config.d_model
        self.projection = nn.Linear(gcn_encoder.output_dim, t5_embedding_dim)
        
    def forward(self, motion, target_text=None):
        # Encode motion: (batch_size, output_dim)
        motion_encoded = self.gcn_encoder(motion)
        
        # Project motion encoding to T5 embedding dimension: (batch_size, t5_embedding_dim)
        motion_encoded = self.projection(motion_encoded)
        
        # Prepare T5 input
        input_text = "motion to text: "
        input_ids = self.tokenizer(
            input_text, return_tensors="pt", padding=True, truncation=True
        ).input_ids.to(motion.device)
        
        # Repeat input_ids for the batch size
        input_ids = input_ids.repeat(motion.size(0), 1)
        
        # Get T5 input embeddings: (batch_size, sequence_length, t5_embedding_dim)
        inputs_embeds = self.t5.get_input_embeddings()(input_ids)
        
        # Concatenate motion encoding with T5 input embeddings
        motion_encoded = motion_encoded.unsqueeze(1)  # (batch_size, 1, t5_embedding_dim)
        inputs_embeds = torch.cat([motion_encoded, inputs_embeds], dim=1)
        
        if target_text is not None:
            # Training mode
            labels = self.tokenizer(
                target_text, return_tensors="pt", padding=True, truncation=True
            ).input_ids.to(motion.device)
            
            # Forward pass with labels
            outputs = self.t5(inputs_embeds=inputs_embeds, labels=labels)
            return outputs.loss
        else:
            # Inference mode
            generated_ids = self.t5.generate(inputs_embeds=inputs_embeds)
            generated_text = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            return generated_text

In [17]:
def augment_motion(motion, noise_scale=0.01):
    noise = torch.randn_like(motion) * noise_scale
    return motion + noise

In [28]:
import numpy as np
# Hyperparameters
input_dim = 3  # x, y, z coordinates
hidden_dim = 64
output_dim = 128
batch_size = 64
learning_rate = 1e-4
num_epochs = 35

# Initialize model and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move adj_matrix to the device
adj_matrix = adj_matrix.to(device)

# Initialize GCN encoder and model
gcn_encoder = GCNEncoder(input_dim, hidden_dim, output_dim, adj_matrix)
model = MotionToTextModel(gcn_encoder).to(device)

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for motion, text in train_loader:
        if np.random.random() < 0.5:
            motion = augment_motion(motion, noise_scale=0.01)
        # Move data to device
        motion = motion.to(device)
        
        # Forward pass
        loss = model(motion, text)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

Epoch 1, Loss: 1.4037609100341797
Epoch 2, Loss: 1.309475064277649
Epoch 3, Loss: 1.7307119369506836
Epoch 4, Loss: 0.7791255712509155
Epoch 5, Loss: 0.977437436580658
Epoch 6, Loss: 1.1363563537597656
Epoch 7, Loss: 1.036194086074829
Epoch 8, Loss: 1.1051623821258545
Epoch 9, Loss: 1.4576685428619385
Epoch 10, Loss: 1.4136892557144165
Epoch 11, Loss: 0.7947259545326233
Epoch 12, Loss: 1.2465604543685913
Epoch 13, Loss: 1.3981118202209473
Epoch 14, Loss: 1.1476515531539917
Epoch 15, Loss: 1.2456958293914795
Epoch 16, Loss: 0.9917926788330078
Epoch 17, Loss: 1.1449682712554932
Epoch 18, Loss: 1.16427743434906
Epoch 19, Loss: 1.1025314331054688
Epoch 20, Loss: 0.9347779154777527
Epoch 21, Loss: 1.597883939743042
Epoch 22, Loss: 1.0885844230651855
Epoch 23, Loss: 1.3894978761672974
Epoch 24, Loss: 1.030968427658081
Epoch 25, Loss: 1.1392749547958374
Epoch 26, Loss: 1.258576512336731
Epoch 27, Loss: 0.9019225835800171
Epoch 28, Loss: 1.009253740310669
Epoch 29, Loss: 0.8514111042022705
Epo

## Training for STGCN

In [20]:
from torch.optim.lr_scheduler import StepLR


# Hyperparameters
input_dim = 3  # x, y, z coordinates
hidden_dim = 64
output_dim = 128
batch_size = 64
learning_rate = 1e-4
num_epochs = 45

# Initialize model and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move edge_index to the device
edge_index = edge_index.to(device)

# Initialize ST-GCN encoder and model
st_gcn_encoder = STGCNEncoder(input_dim, hidden_dim, output_dim, edge_index)
model = MotionToTextModel(st_gcn_encoder).to(device)

# Initialize optimizer and scheduler
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)  # Add weight decay
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)  # Reduce LR by 0.1 every 10 epochs

# Training loop
for epoch in range(num_epochs):
    model.train()
    for motion, text in train_loader:
        if np.random.random() < 0.5:
            motion = augment_motion(motion, noise_scale=0.01)
        # Move data to device
        motion = motion.to(device)
        
        # Forward pass
        loss = model(motion, text)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    scheduler.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

Epoch 1, Loss: 0.9364256858825684
Epoch 2, Loss: 1.1304962635040283
Epoch 3, Loss: 1.2142441272735596
Epoch 4, Loss: 1.556441307067871
Epoch 5, Loss: 1.1923068761825562
Epoch 6, Loss: 0.9270095825195312
Epoch 7, Loss: 1.2069671154022217
Epoch 8, Loss: 0.9979519844055176
Epoch 9, Loss: 1.3539304733276367
Epoch 10, Loss: 1.0229213237762451
Epoch 11, Loss: 1.2794729471206665
Epoch 12, Loss: 1.0508111715316772
Epoch 13, Loss: 1.190352201461792
Epoch 14, Loss: 0.9436461925506592
Epoch 15, Loss: 1.4238595962524414
Epoch 16, Loss: 1.102981686592102
Epoch 17, Loss: 1.214152455329895
Epoch 18, Loss: 0.8955529928207397
Epoch 19, Loss: 1.2250924110412598
Epoch 20, Loss: 1.0722805261611938
Epoch 21, Loss: 0.7945737242698669
Epoch 22, Loss: 0.776954174041748
Epoch 23, Loss: 1.018916368484497
Epoch 24, Loss: 0.7210644483566284
Epoch 25, Loss: 1.3190574645996094
Epoch 26, Loss: 0.9304620623588562
Epoch 27, Loss: 1.0007092952728271
Epoch 28, Loss: 1.318511724472046
Epoch 29, Loss: 1.180087685585022
Ep

## Training with optimized learning_rate

In [None]:
from torch.optim.lr_scheduler import StepLR

# Hyperparameters
input_dim = 3  # x, y, z coordinates
hidden_dim = 64
output_dim = 128
batch_size = 64
learning_rate = 1e-4
num_epochs = 20

# Initialize model and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move adj_matrix to the device
adj_matrix = adj_matrix.to(device)

# Initialize GCN encoder and model
gcn_encoder = GCNEncoder(input_dim, hidden_dim, output_dim, adj_matrix)
model = MotionToTextModel(gcn_encoder).to(device)

# Initialize optimizer and scheduler
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)  # Add weight decay
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)  # Reduce LR by 0.1 every 10 epochs


# Training loop
for epoch in range(num_epochs):
    model.train()
    for motion, text in train_loader:
        # Move data to device
        motion = motion.to(device)
        
        # Forward pass
        loss = model(motion, text)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    scheduler.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

In [29]:
model.eval()
with torch.no_grad():
    count = 0
    for motion, text in valid_loader:
        
        if count > 10 :
            break
        motion = motion.to(device)
        generated_text = model(motion)
        print("Generated Text:", generated_text[0])
        print("Ground Truth Text:", text[0])
        count += 1
        

Generated Text: a person is sitting down and then stands back up.
Ground Truth Text: the man has crossed his legs and sat down
Generated Text: a person walks in a circle, turns around and walks back.
Ground Truth Text: the person walks in a straight line at a angle to their left, then turns around and jogs back to the start.
Generated Text: a person walks backwards, then turns around and walks backwards.
Ground Truth Text: a person is spinning slowly in place, reaching hands out occasionally.
Generated Text: a person walks in a counterclockwise circle.
Ground Truth Text: person walks to the left, opens something and then walks back
Generated Text: a person walks backwards, then turns around and walks backwards.
Ground Truth Text: a person ice skating in a circle
Generated Text: a person walks forward, turns around and walks back.
Ground Truth Text: a man walks slowly forward.
Generated Text: a person sits down and stands back up.
Ground Truth Text: a person is swimming slowly.
Generate

In [22]:
import csv
motion_dir = "/kaggle/input/human-motion-description-hmd-motion-to-text/motions/"  # Folder containing .npy files
test_file = "/kaggle/input/human-motion-description-hmd-motion-to-text/test.txt"  # List of motion IDs
output_csv = "submission.csv"  # Output CSV
captions = [["id", "text"]]  # Store results
with open(test_file, "r") as f:
    motion_ids = f.read().splitlines()  # Read all motion IDs
for motion_id in tqdm(motion_ids, desc="Processing motions"):
    model.train()
    motion_path = os.path.join(motion_dir, motion_id + ".npy")
    
    if not os.path.exists(motion_path):
        print(f" Motion file {motion_id} not found. Skipping...")
        continue
    
    
    motion = np.load(os.path.join(motion_data_dir, motion_path))
    motion = torch.from_numpy(motion).unsqueeze(0)
    motion = motion.to(device)
    generated_text = model(motion)
    # Save result
    captions.append([motion_id, generated_text])

with open(output_csv, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(captions)

print(f"\n Motion captions saved to {output_csv}!")

Processing motions: 100%|██████████| 1000/1000 [02:34<00:00,  6.47it/s]


 Motion captions saved to submission.csv!





In [23]:
import pandas as pd
sub = pd.read_csv('/kaggle/working/submission.csv')
sub["text"] = sub["text"].str.strip("[]'")
sub.to_csv('submission_.csv')

In [24]:
sub

Unnamed: 0,id,text
0,014295,a person walks forward and turns around.
1,005166,"a person walks forward, turns around and walks..."
2,M002388,a person walks forward and then turns around a...
3,M007286,a person walks forward slowly.
4,000304,"a person walks forward, then turns around and ..."
...,...,...
995,M011372,a person walks forward and turns around and wa...
996,M007007,a person walks forward and then turns to the l...
997,M010671,a person walks forward and then turns around a...
998,001038,"a person walks forward, turns around, and walk..."


### **Evaluation Metric `Meteor-Score`**

In [None]:
"""
Text Generation Evaluation Metric (Meteor Score)

Evaluates submissions using meteor score to reference texts
"""
import bisect
from collections import defaultdict

def compute_lis(arr):
    """Compute longest increasing subsequence (O(n log n) time)."""
    tails = []
    for num in arr:
        idx = bisect.bisect_left(tails, num)
        if idx == len(tails):
            tails.append(num)
        else:
            tails[idx] = num
    return tails

def _calculate_chunks(reference_unigrams, candidate_unigrams):
    """Optimized chunk calculation using LIS."""
    # Create inverted index for reference words
    word_to_ref_indices = defaultdict(list)
    for idx, word in enumerate(reference_unigrams):
        word_to_ref_indices[word].append(idx)
    
    # Collect all matching positions
    matches = []
    for c_idx, word in enumerate(candidate_unigrams):
        if word in word_to_ref_indices:
            matches.extend((r_idx, c_idx) for r_idx in word_to_ref_indices[word])
    
    if not matches:
        return 0, 0
    
    # Sort matches by reference index then candidate index
    matches.sort(key=lambda x: (x[0], x[1]))
    cand_indices = [c_idx for _, c_idx in matches]
    
    # Get LIS of candidate indices
    lis = compute_lis(cand_indices)
    if not lis:
        return 0, 0
    
    # Calculate number of chunks
    chunks = 1
    for i in range(1, len(lis)):
        if lis[i] != lis[i-1] + 1:
            chunks += 1
    
    return chunks, len(lis)

def meteor(reference, candidate):
    """Optimized METEOR score calculation."""
    ref_tokens = reference.split()
    can_tokens = candidate.split()
    
    # Fast intersection check
    ref_words = set(ref_tokens)
    if not any(word in ref_words for word in can_tokens):
        return 0.0
    
    # Calculate precision/recall
    m = sum(1 for word in can_tokens if word in ref_words)
    precision = m / len(can_tokens) if can_tokens else 0
    recall = m / len(ref_tokens) if ref_tokens else 0
    
    # Handle edge cases
    if m == 0:
        return 0.0
    
    # Harmonic mean
    f_mean = (10 * precision * recall) / (recall + 9 * precision) if (precision + recall) > 0 else 0
    
    # Penalty calculation
    chunks, mappings = _calculate_chunks(ref_tokens, can_tokens)
    penalty = 0.5 * (chunks / mappings) ** 3
    
    return min(f_mean * (1 - penalty), 1.0)


def get_meteor_score(references, candidate):
    
    return max([meteor(reference, candidate) for reference in references])

In [None]:
## Usage Example
reference_texts = ["a person walks aimlessly and slowly in an imperfect circle around the room, lathargecly swaying their arms with each step.",
                   "a person walking with their arms swinging back to front and walking in a general circle.",
                   "a person walks in an oval path and ends where he started.",
                   ]
predicted_text = "a person walks in a circle path swinging with arms."

get_meteor_score(reference_texts, predicted_text)

### **Code to generate your submission `.csv` file**

In [None]:
import pandas as pd
import numpy as np
import random, string

## /!\ RED ALERT STORM
## This is just to generate random text to show you an example of submission
## In your case, you have to predict the texts using your trained model !
def generate_random_text(length):
    """generates random sentences"""
    letters = string.ascii_lowercase + '     '
    return ''.join(random.choice(letters) for i in range(length))

## /!\ alerte rouge ! vents -> replace this with your actual predictions
test_texts_ids = np.arange(0, 1000).astype('str') # list of test texts ids
pred_test_texts = [generate_random_text(30) for i in range(1000)] ## pred_test_text [numpy array or list] is your predicted texts #shape: (1000,), /!\ in the same order as the ids !

## create submission dataframe
submission_df = pd.DataFrame({'id': test_texts_ids,
                              'text': pred_test_texts})
submission_df.to_csv('./submission.csv', index=False)
submission_df

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Charger le fichier .npy
image = np.load("./motions")

# Afficher la forme de l'image
print(f"Shape de l'image : {image.shape}")

# Afficher l'image
plt.imshow(image, cmap="gray" if len(image.shape) == 2 else None)
plt.axis("off")  # Supprimer les axes
plt.show()
