# Data Preprocessing

In this section of the project, we will be working to preprocess and clean the data for analysis. We will be working with the following steps:

1) Visualize the data and identify any trends and patterns

2) Identify the target features and the predictor features

3) Identify any missing data and outliers

4) Identify how best to tackle the challenge

## Step 1. Load the Data

In [1]:
# Load the Test, Train, and Validation data from the numpy files
import numpy as np
import pandas as pd
import os 
import torch


# Load the 'xla' data
# Specify the path to be: npz/layout/xla/default/test
xla_default_layout_test = np.load('data/tpugraphs/npz/layout/xla/default/test/3e7156ac468dfb75cf5c9615e1e5887d.npz')

# List the names of the arrays in the file
array_names = xla_default_layout_test.files

# Loop through the arrays, and print their names and shapes 
for name in array_names:
    array = xla_default_layout_test[name]
    print(f" Array Name: {name}, Array Shape: {array.shape}")

 Array Name: node_feat, Array Shape: (41522, 140)
 Array Name: node_opcode, Array Shape: (41522,)
 Array Name: edge_index, Array Shape: (72902, 2)
 Array Name: node_config_feat, Array Shape: (1000, 2244, 18)
 Array Name: node_config_ids, Array Shape: (2244,)
 Array Name: config_runtime, Array Shape: (1000,)
 Array Name: node_splits, Array Shape: (1, 2)


# Step 2: Understanding the Data

The data provided came from a 'npz' file, which is a numpy array file holding numerous numpy arrays. This data format is different from traditional csv files, which is typically two-dimensional. numpy arrays allow for multi-dimensional arrays, which is useful for storing data in a more complex format.

When looking to understand the data, it is useful to visualize the number of dimensions within each specific array and the shape of each array. This will help us understand the data and how to best work with it.

For example, we can see that the 'node_feat' array is two-dimensional, with a shape of: (41522,140), meaning that there are 41522 rows and 140 columns. This means that there are 41522 nodes and 140 features for each node.

We can also see that the 'node_config_feat' array is three-dimensional, with a shape of: (1000,2244,18), meaning that there are 1000 rows, 2244 columns, and 18 layers. This means that there are 1000 nodes, 2244 nodes, and 18 features for each node.



In [2]:
# Start visualizing the data
node_feat_xla_test = xla_default_layout_test['node_feat']


In [3]:
## Create a function to understand the data

def print_array_info(data_npz):
    array_names = data_npz.files

    for name in array_names:
        array = data_npz[name]
        print(f" Array Name: {name}")
        print(f" Array Shape: {array.shape}")
        
        # Check the dimensionality of the array
        if array.ndim == 1:
            print(f" The {name} array is 1D")
        elif array.ndim == 2:
            print(f" The {name} array is 2D")
        elif array.ndim == 3:
            print(f" The {name} array is 3D")
        print("")


### -- Layout / XLA / Default / Test

In [4]:
# Load the file names within the test data folder
test_data_files = os.listdir('npz/layout/xla/default/test/')
test_data_df = pd.DataFrame(test_data_files, columns=['File Name'])

# Repeat the previous process, to understand all the arrays in the test data

for file in test_data_files:
    print(f"File Name: {file}")
    print("")
    xla_default_layout_test = np.load(f'data/tpugraphs/npz/layout/xla/default/test/{file}')
    print_array_info(xla_default_layout_test)
    print("")

FileNotFoundError: [Errno 2] No such file or directory: 'npz/layout/xla/default/test/'

### -- Layout / XLA / Default / Train

In [None]:
# Load the file names within the test data folder
test_data_files = os.listdir('data/tpugraphs/npz/layout/xla/default/train/')
test_data_df = pd.DataFrame(test_data_files, columns=['File Name'])

# Repeat the previous process, to understand all the arrays in the test data

for file in test_data_files:
    print(f"File Name: {file}")
    print("")
    xla_default_layout_train = np.load(f'data/tpugraphs/npz/layout/xla/default/train/{file}')
    print_array_info(xla_default_layout_test)
    print("")


File Name: brax_es.npz

 Array Name: node_feat
 Array Shape: (24790, 140)
 The node_feat array is 2D

 Array Name: node_opcode
 Array Shape: (24790,)
 The node_opcode array is 1D

 Array Name: edge_index
 Array Shape: (32709, 2)
 The edge_index array is 2D

 Array Name: node_config_feat
 Array Shape: (1000, 2002, 18)
 The node_config_feat array is 3D

 Array Name: node_config_ids
 Array Shape: (2002,)
 The node_config_ids array is 1D

 Array Name: config_runtime
 Array Shape: (1000,)
 The config_runtime array is 1D

 Array Name: node_splits
 Array Shape: (1, 482)
 The node_splits array is 2D


File Name: transformer.4x4.fp32.performance.npz

 Array Name: node_feat
 Array Shape: (24790, 140)
 The node_feat array is 2D

 Array Name: node_opcode
 Array Shape: (24790,)
 The node_opcode array is 1D

 Array Name: edge_index
 Array Shape: (32709, 2)
 The edge_index array is 2D

 Array Name: node_config_feat
 Array Shape: (1000, 2002, 18)
 The node_config_feat array is 3D

 Array Name: node_co

### -- Layout / XLA / Default / Valid

In [None]:
# Load the file names within the test data folder
test_data_files = os.listdir('data/tpugraphs/npz/layout/xla/default/valid/')
test_data_df = pd.DataFrame(test_data_files, columns=['File Name'])

# Repeat the previous process, to understand all the arrays in the test data

for file in test_data_files:
    print(f"File Name: {file}")
    print("")
    xla_default_layout_train = np.load(f'data/tpugraphs/npz/layout/xla/default/valid/{file}')
    print_array_info(xla_default_layout_test)
    print("")

File Name: mlperf_bert_batch_24_2x2.npz

 Array Name: node_feat
 Array Shape: (24790, 140)
 The node_feat array is 2D

 Array Name: node_opcode
 Array Shape: (24790,)
 The node_opcode array is 1D

 Array Name: edge_index
 Array Shape: (32709, 2)
 The edge_index array is 2D

 Array Name: node_config_feat
 Array Shape: (1000, 2002, 18)
 The node_config_feat array is 3D

 Array Name: node_config_ids
 Array Shape: (2002,)
 The node_config_ids array is 1D

 Array Name: config_runtime
 Array Shape: (1000,)
 The config_runtime array is 1D

 Array Name: node_splits
 Array Shape: (1, 482)
 The node_splits array is 2D


File Name: resnet50.4x4.fp16.npz

 Array Name: node_feat
 Array Shape: (24790, 140)
 The node_feat array is 2D

 Array Name: node_opcode
 Array Shape: (24790,)
 The node_opcode array is 1D

 Array Name: edge_index
 Array Shape: (32709, 2)
 The edge_index array is 2D

 Array Name: node_config_feat
 Array Shape: (1000, 2002, 18)
 The node_config_feat array is 3D

 Array Name: node_

---------------------------------------------------------------------------

## Analysis of the Data
### -- Layout / XLA / Default 

# 1) Test Data:

### Overview:

1. Number of Files: There are 8 .npz files in the dataset.
2. Shared Array Names Across Files: All files have arrays named:
- node_feat
- node_opcode
- edge_index
- node_config_feat
- node_config_ids
- config_runtime
- node_splits


### Specific Observations:

1. node_feat:
This seems to represent features associated with nodes in a graph. The size of the second dimension is consistently 140, suggesting that every node has 140 features.
The number of nodes (i.e., the first dimension) varies across the dataset, ranging from 490 to 43,615.

2. node_opcode:
This is likely an opcode (operation code) or a label associated with each node. It's a 1D array, and its length aligns with the number of nodes in the node_feat array.

3. edge_index:
This array possibly represents the source and destination indices of edges in the graph. Each row could denote an edge, with the two columns representing the source and target nodes, respectively.
The number of edges varies between files, from 749 to 73,881.

4. node_config_feat:
This array seems to contain configurations related to nodes. It is consistently sampled 1,000 times (first dimension), but the number of nodes (second dimension) and features per node (third dimension) varies. However, each configuration has 18 features (third dimension).

5. node_config_ids:
A 1D array that might represent unique identifiers or labels for node configurations. The length of this array matches the second dimension of the node_config_feat array.

6. config_runtime:
This is likely a measure of runtime for the configurations. Each file has data for 1,000 configurations.

7. node_splits:
Represents some split on the nodes. The meaning is not immediately clear, but it's noteworthy that the number of splits varies across files.

### Summary:
The datasets are graph-based with associated configurations and runtime measurements.

Nodes have features (node_feat) and opcodes (node_opcode), and there's information on edges connecting them (edge_index).

Nodes have configurations (node_config_feat) with associated IDs (node_config_ids) and runtimes (config_runtime).

There's some unknown split or categorization on nodes (node_splits).

### ChatGPT Recommendations for Further Analysis:

#### Visualize Graphs:

Plot some of the smaller graphs to visualize their structure. For larger graphs, consider plotting a subset or use algorithms to find a meaningful subgraph.
Examine the distribution of node degrees and other graph metrics.


#### Examine Node Features:

Analyze the distribution of values in the node_feat arrays.
Consider dimensionality reduction techniques (e.g., PCA) to visualize the high-dimensional node features in 2D or 3D.
Study Configuration Run Times:

Analyze the config_runtime array to see the distribution of runtimes across configurations.
Correlate node configurations (node_config_feat) with runtimes to determine if specific configurations lead to faster or slower runtimes.
Node Splits Investigation:

Investigate node_splits more deeply to understand its purpose and relevance.
General Data Exploration:

For each array, check for missing values or anomalies.
Consider normalization or standardization if values vary widely.
Data Correlation:

Explore if there's a correlation between node features and other attributes like configurations, opcodes, or runtime.

Remember, understanding domain-specific context and having access to experts or documentation can provide even more insight into the nature and importance of each dataset feature.




# 2. Train Data:


#### Analysis of Each Array:

1. *node_feat*:

Shape: (24790, 140)
It's a 2D array. Given the name, this is likely a feature representation of nodes. Each node has a feature vector of length 140, and there are 24790 such nodes.

2. *node_opcode*:

Shape: (24790,)
It's a 1D array. This could be an identifier or type for each node. Since it's 1D and has the same length as the first dimension of node_feat, each value likely corresponds to a specific node.

3. *edge_index*:

Shape: (32709, 2)
It's a 2D array. This suggests that there are 32709 edges. Each edge is described by a pair, possibly indicating a connection from one node to another (i.e., source and target nodes).

4. *node_config_feat*:

Shape: (1000, 2002, 18)
It's a 3D array. This is likely a feature representation for node configurations. There are 1000 such configurations, each having 2002 nodes, and each node in the configuration has 18 features.

5. *node_config_ids*:

Shape: (2002,)
It's a 1D array. This is probably an identifier or type for each node in the configuration.

6. *config_runtime*:

Shape: (1000,)
It's a 1D array. Given its name, this could represent some runtime values or metrics for each of the 1000 configurations.

7. *node_splits*:

Shape: (1, 482)
It's a 2D array. Not entirely clear from the name, but it might represent some segmentation or partitioning of nodes. There's only one such segmentation with 482 segments or sections.


#### Names of Files:
The filenames seem to be indicative of some machine learning architectures or models:

transformer.4x4.fp32
resnet50.8x8.fp32
bert_pretraining.8x16.fp16
transformer.2x2.fp32
mnasnet_a1_batch_128
resnet_v2_50_batch_16
mlperf_transformer

From the names, they might represent different configurations or versions of certain models, e.g., the 'fp32' or 'fp16' might indicate the precision of the data (32-bit floating point or 16-bit floating point).


---------------------------------------------------------------------------

# Our Approach:

## **Objective:**
The main goal is to predict the runtime of various AI models (represented as graphs) under different compiler configurations, with the end goal being finding the optimal configuration for each model to run as efficiently as possible.

## **Preliminary Analysis and Approach**:

1. **Graph Structure and Features**:  
    - AI models are represented as graphs, where nodes represent tensor operations and edges represent tensors.
    - `node_feat` and `node_opcode` likely represent features related to tensor operations.
    - `edge_index` is the representation of tensors or connections between tensor operations.

2. **Compiler Configurations**:
    - There are two main configurations: layout configuration and tile configuration.
    - `node_config_feat` with its 3D representation probably captures the essence of these configurations across different nodes.
    - `config_runtime` is crucial as it represents the runtime associated with these configurations - which is our main target to predict.

3. **ML Model Development**:
    - Given the nature of the data (graphs), Graph Neural Networks (GNNs) or similar models would be suitable for this task as they can capture the topology and features of graphs.
    - Input features would include node and edge features. The target variable would be `config_runtime`.
    - The model would be trained on the provided runtime data to understand the relationship between configurations and their resulting runtimes. The model can then predict runtimes for configurations in the test dataset.

4. **Evaluation and Optimization**:
    - Once the initial model is trained, it's important to evaluate its performance on a validation set. This will give insights into how well the model is likely to perform on unseen data.
    - Hyperparameter tuning, feature engineering, and other optimization techniques can be applied to improve the model's accuracy.

5. **Recommendation System**:
    - Based on the predicted runtimes, a recommendation system can be built to suggest the most optimal configuration for a given graph. The configuration that yields the lowest predicted runtime would be the recommended one.

6. **Handling Different Data Collections**:
    - The provided datasets (`layout:xla:random`, `layout:xla:default`, `layout:nlp:random`, `layout:nlp:default`, and `tile:xla`) might have subtle differences. It might be beneficial to train separate models for each collection or to include the collection type as an additional feature if using a unified model.

7. **Kaggle Competition**:
    - Remember to adhere to Kaggle's competition rules and submission guidelines. Make sure to preprocess and post-process the data as required by the competition.



In [None]:
# Load the file names within the test data folder
test_data_files = os.listdir('data/tpugraphs/npz/layout/nlp/default/test/')
test_data_df = pd.DataFrame(test_data_files, columns=['File Name'])

# Repeat the previous process, to understand all the arrays in the test data

for file in test_data_files:
    print(f"File Name: {file}")
    print("")
    xla_default_layout_train = np.load(f'data/tpugraphs/npz/layout/nlp/default/test/{file}')
    print_array_info(xla_default_layout_test)
    print("")

File Name: 171b0513d8874a427ccfa46d136fbadc.npz

 Array Name: node_feat
 Array Shape: (24790, 140)
 The node_feat array is 2D

 Array Name: node_opcode
 Array Shape: (24790,)
 The node_opcode array is 1D

 Array Name: edge_index
 Array Shape: (32709, 2)
 The edge_index array is 2D

 Array Name: node_config_feat
 Array Shape: (1000, 2002, 18)
 The node_config_feat array is 3D

 Array Name: node_config_ids
 Array Shape: (2002,)
 The node_config_ids array is 1D

 Array Name: config_runtime
 Array Shape: (1000,)
 The config_runtime array is 1D

 Array Name: node_splits
 Array Shape: (1, 482)
 The node_splits array is 2D


File Name: 7105451001e119f65b66570d170b94a8.npz

 Array Name: node_feat
 Array Shape: (24790, 140)
 The node_feat array is 2D

 Array Name: node_opcode
 Array Shape: (24790,)
 The node_opcode array is 1D

 Array Name: edge_index
 Array Shape: (32709, 2)
 The edge_index array is 2D

 Array Name: node_config_feat
 Array Shape: (1000, 2002, 18)
 The node_config_feat array is

## TensorBoard - Visualization of the Data

Using Matplotlib, we will be visualizing the data of the tensors, before we begin testing and making any further predictions. 


In [None]:
display(test_data_df)

Unnamed: 0,File Name
0,171b0513d8874a427ccfa46d136fbadc.npz
1,7105451001e119f65b66570d170b94a8.npz
2,3a0c5517a87df8d82fd637b83298a3ba.npz
3,016ac66a44a906a695afd2228509046a.npz
4,d15316c12eefdef1ba549eb433797f77.npz
5,6c1101f6231f4d1722c3b9f6d1e25026.npz
6,60880ed76de53f4d7a1b960b24f20f7d.npz
7,b2fdde3b72980907578648774101543e.npz
8,29886a50d55cfe77a9497bc906c76ce9.npz
9,32531d07a084b319dce484f53a4cf3fc.npz


In [None]:

import tensorboard 
from torch.utils.tensorboard import SummaryWriter



In [None]:
# Path: data_visualization.ipynb

# Visualize the data using TensorBoard
# We will be using the SummaryWriter class to write the data to TensorBoard
# And as of now, we will focus on the test data for our natural processing language models
# The nlp models are stored as files, with numerous features with varying dimensions

writer = SummaryWriter()

# Load the file name of the first file in the test data
test_data_files = os.listdir('data/tpugraphs/npz/layout/nlp/default/test/')
test_data_df = pd.DataFrame(test_data_files, columns=['File Name'])
display(test_data_files[0])
file_name = test_data_files[0]

# Load the data from the file
nlp_default_layout_test_file1 = np.load(f'data/tpugraphs/npz/layout/nlp/default/test/{file_name}')




'171b0513d8874a427ccfa46d136fbadc.npz'

In [None]:
print_array_info(nlp_default_layout_test_file1)


 Array Name: node_feat
 Array Shape: (10124, 140)
 The node_feat array is 2D

 Array Name: node_opcode
 Array Shape: (10124,)
 The node_opcode array is 1D

 Array Name: edge_index
 Array Shape: (16696, 2)
 The edge_index array is 2D

 Array Name: node_config_feat
 Array Shape: (1000, 344, 18)
 The node_config_feat array is 3D

 Array Name: node_config_ids
 Array Shape: (344,)
 The node_config_ids array is 1D

 Array Name: config_runtime
 Array Shape: (1000,)
 The config_runtime array is 1D

 Array Name: node_splits
 Array Shape: (1, 2)
 The node_splits array is 2D



In [None]:
# Load the data from the file
nlp_default_layout_test_file1 = np.load(f'data/tpugraphs/npz/layout/nlp/default/test/{file_name}')

node_feat = nlp_default_layout_test_file1['node_feat']

node_opcode = nlp_default_layout_test_file1['node_opcode']

edge_index = nlp_default_layout_test_file1['edge_index'] 

node_config_feat = nlp_default_layout_test_file1['node_config_feat']

node_config_ids = nlp_default_layout_test_file1['node_config_ids']

config_runtime = nlp_default_layout_test_file1['config_runtime']

node_splits = nlp_default_layout_test_file1['node_splits']




In [None]:
# Convert the numpy arrays to tensors
node_feat_tensor = torch.from_numpy(node_feat)

node_opcode_tensor = torch.from_numpy(node_opcode)

edge_index_tensor = torch.from_numpy(edge_index)

node_config_feat_tensor = torch.from_numpy(node_config_feat)

node_config_ids_tensor = torch.from_numpy(node_config_ids)

config_runtime_tensor = torch.from_numpy(config_runtime)

node_splits_tensor = torch.from_numpy(node_splits)







In [None]:
# Write the data to TensorBoard
writer.add_histogram('node_feat', node_feat_tensor)

writer.add_histogram('node_opcode', node_opcode_tensor)

writer.add_histogram('edge_index', edge_index_tensor)

writer.add_histogram('node_config_feat', node_config_feat_tensor)

writer.add_histogram('node_config_ids', node_config_ids_tensor)

writer.add_histogram('config_runtime', config_runtime_tensor)

writer.add_histogram('node_splits', node_splits_tensor)

writer.close()

In [None]:


for i in range(node_config_feat.shape[0]):
    # Write the data to TensorBoard
    writer.add_embedding(node_feat[i], metadata=node_opcode, tag='Node Features')
    writer.add_embedding(node_config_feat[i], metadata=node_config_ids, tag=f'Config {i}', global_step=i)
# embedd node_config_feat, 


In [None]:
import networkx as nx
import matplotlib.pyplot as plt


