In [1]:
import numpy as np
import pandas as pd
import glob

Schematic: Proposed Flow: Start with Layout - Capture Node Features , their opcodes, their edge connectivity  -> concatenate to supply as input to GNN -> get a cumulative runtime for reconfigurable nodes.

What Layout has: Node dependencies , Opcode for each node to note its functionality, Node features which are its structures with inpute and output layer configuration, IDs of Nodes which can be configured, features of the nodes which can be experimented with and altered, configuration runtimes to denote the runtime taken for each alteration.

What Tile has: Node dependencies , Opcode of each node in a fused subgraph to note their functionality, Rest the same with differences between runtime normalizers as we are observing this at kernel level. Config Features have an extra sum/product differentiators.

How this data is arranged: Layout has each of alexnet, bert, audio, video data in XLA. Tile has a follow up of these features, different tiles arranged sequentially. To process this, use regex to first optimize tile features. 

Will we use GCN or someother architecture?
GCN would definitely facilitate, provided we make the model predictions variable. Assign data classes to be equal to number of node features. Produce a mathematical equation that computes the runtime, i.e, integrates the multiple dependency runtimes. In channels also should be 140.

Approach: Embed the node opcodes with its features which will be passed as input. Get edge conditions, i.e, node connections and flow.

Figure out: What determines the runtime of each configuration. In other words, output how many times each Opcode's node was used, alongwith info of the connections it has in the value the output node would carry. 
Example: If Addition node was used 10 times, and is connected with max, mul, let it have the value 10+(number of nodes with similar dimension values)*0.5 for max+ (number of nodes with similar dimension values)*2. Having a rough estimate of runtime and a statistial parameter that relates to the actual and predicted value will help.
Details on embedding runtime and configuration features: We need an intermediate layer of gcn with 18 input nodes and 1 output node. This embedding should have the opcode of the particular configurable node i.

Algorithm:
 1. Determine the runtime of the operation associated with each opcode.
 2. Multiply 1/100 * opcode * runtime. Name this result oper_run_time.
 3. Define GRU with 140 input nodes, 18 intermediate nodes and 1 output node.
 4. Multiply oper_run_time * node_feat to produce consolidated_feat.
 5. For each connected edge j of node i:
          set inp_cons = 2 if input is being consumed 1 if it is being given out.
          consolidated_feat[i] = consolidated_feat[i]+ inp_cons * consolidated_feat[j] 
          i=i+1
 6. Add all of the consolidated_feat vectors.
 7. Check in config feat if this node exists and set consolidated_feat[nfi]+=1 else consolidated_feat [nfi] = consolidated_feat[nfi]
 8. Pass consolidated_feat[nfi] as input to each input gcn node gi = nfi.
 9. Let intermediate layer 1 have the summation function.
 10. For each configurable node ci, pass consolidated_feat[ci] * node_config_feat[i] + output of the summation function for each configuration iteration.
 11. Train the model with runtime/runtime_normalizers of each configuration as output.




In [2]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [3]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

In [4]:
class GRURuntimePredictor(nn.Module):
    def __init__(self, input_dim1=140, hidden_dim1=18, gru_input_dim=18, gru_hidden_dim=32, output_dim=1):
        super(GRURuntimePredictor, self).__init__()
        self.fc1 = nn.Linear(input_dim1, hidden_dim1)
        self.gru = nn.GRU(input_size=gru_input_dim, hidden_size=gru_hidden_dim, batch_first=True)
        self.fc2 = nn.Linear(gru_hidden_dim, output_dim)

    def forward(self,x,y):
        x_gru_input = torch.relu(x) 
        x_gru_input= x_gru_input.unsqueeze(-1)
        y_gru_input = torch.relu(y) 
        cons_gru_input= x_gru_input*y_gru_input
        gru_output, _ = self.gru(cons_gru_input)
        final_output = self.fc2(gru_output) 
        pred_output = torch.sum(final_output)
        pred_output= pred_output.unsqueeze(-1)
        return pred_output

In [5]:
def runtime_of_operations_from_opcode(Opcode):
    runtimes = [ 3,4,3,4, 1, 6, 9, 6, 6, 2, 6,3, 7, 2, 2, 7, 1, 7, 2,3,8,8,
     8,    2,    5,    5,    6,    1,    9,    2,    4,   14,   21,
    43,   34,   12,   40,   41,   43,   49,   42,   28,   36,   30,
    24,   16,   36,   24,   26,   33,   28,   10,   36,   15,   36,
    11,   10,   17,   15,   31,   16,  305,  219,  160,  243,  214,
   422,  223,  122,  198,   97,  421,  100,  229,  435,  241,  482,
   417,  241,  224,  176,  164,  256,   63,  188,  152,  461,  185,
   297,  167,   60,  782, 4802, 4935, 3594, 3657,  613, 3482, 2442,
  3008, 2300, 4915, 1195, 1717, 4389, 4474, 4299, 4373,  946, 1662,
  4832, 2585, 4271, 4712, 3309,  589, 2386, 3385, 2077, 4332]
    return runtimes[Opcode-1]

In [6]:
d1= dict(np.load("/kaggle/input/predict-ai-model-runtime/npz_all/npz/layout/xla/default/train/alexnet_train_batch_32.npz"))

In [7]:
def opcode_based_runtime_pred(d_xla,num_nodes):
    for i in range(num_nodes):
        oper_run_time = runtime_of_operations_from_opcode(d_xla["node_opcode"][i])
        d_xla["node_feat"][i] = 0.0000001 * d_xla["node_opcode"][i]* oper_run_time * d_xla["node_feat"][i]

In [8]:
def find_row(matrix,target):
    return [i for i, row in enumerate(matrix) if target in row]

In [9]:
def edge_dependency_embedding(d_xla,num_nodes):
    for i in range(num_nodes):
        occur_node = find_row(d_xla["edge_index"],i)
        if len(occur_node)>0:
            for j in range(len(occur_node)):
                if d_xla["edge_index"][j][0] == i:
                    d_xla["node_feat"][i] = (d_xla["node_feat"][d_xla["edge_index"][j][1]]*2)+d_xla["node_feat"][i]
                elif d_xla["edge_index"][j][1] == i:
                    d_xla["node_feat"][i] = d_xla["node_feat"][d_xla["edge_index"][j][0]]+d_xla["node_feat"][i]

In [10]:
def check_if_configurable(d_xla):
    for i in d_xla["node_config_ids"]:
        d_xla["node_feat"][i]+=1

In [11]:
def generate_consolidated_node_features(d_xla,num_nodes):
    consolidated_node_feat= []
    for i in range(140):
        cons_feat=0
        for j in range(num_nodes):
            cons_feat+=d_xla["node_feat"][j][i]
        consolidated_node_feat.append(cons_feat)
    consolidated_node_feat=torch.tensor(consolidated_node_feat)
    consolidated_node_feat=consolidated_node_feat.float()
    return consolidated_node_feat

In [12]:
def configuration_convolution(d_xla,num_configurable_nodes):
    node_config_feat = []
    for i in range(num_configurable_nodes):
        node_config_cons_feat = []
        cons_feat=0
        for j in range(18):
            for k in range(num_configurable_nodes):
                cons_feat+=(d_xla["node_feat"][d_xla["node_config_ids"][k]])*(d_xla["node_config_feat"][i][k][j])
            node_config_cons_feat.append(cons_feat)
        node_config_feat.append(node_config_cons_feat) 
    return node_config_feat    

In [13]:
def data_transformation(d_xla,num_nodes,num_configurable_nodes):
    node_config_feat_trans = []
    config_runtime_trans = []
    opcode_based_runtime_pred(d_xla,num_nodes)
    edge_dependency_embedding(d_xla,num_nodes)
    check_if_configurable(d_xla)
    cons_node_feat = generate_consolidated_node_features(d_xla,num_nodes)
    node_config_feature = configuration_convolution(d_xla,num_configurable_nodes)
    for i in range(num_configurable_nodes):
        node_config_feat_tensor=torch.tensor(node_config_feature[i])
        node_config_feat_tensor=torch.t(node_config_feat_tensor)
        config_runtime_tensor=torch.tensor(0.0001* d_xla["config_runtime"][i]).float()
        config_runtime_tensor=config_runtime_tensor.unsqueeze(-1)
        node_config_feat_trans.append(node_config_feat_tensor)
        config_runtime_trans.append(config_runtime_tensor)
    return cons_node_feat,node_config_feat_trans,config_runtime_trans

In [14]:
def model_training(d_xla,num_nodes,num_configurable_nodes,model):
    optimizer = torch.optim.Adam(model.parameters(),lr=0.8)
    criterion = torch.nn.MSELoss()
    cons_node_feat,node_config_feat_tens,config_runtime_tens = data_transformation(d_xla,num_nodes,num_configurable_nodes)
    cons_loss = 1
    for i in range(num_configurable_nodes):
        pred1 = model(cons_node_feat,node_config_feat_tens[i])
        loss1 = criterion(pred1,config_runtime_tens[i]) * 0.00000001
        optimizer.zero_grad()
        loss1.backward()
        optimizer.step()
        cons_loss+=loss1
    return cons_loss

In [15]:
model = GRURuntimePredictor()
xla_dataset = glob.glob("/kaggle/input/predict-ai-model-runtime/npz_all/npz/layout/xla/default/train/*.npz")

In [16]:
len_dataset = len(xla_dataset)
cons_train_loss = 0
train_loss_per_iter = 100

In [17]:
for i in range(len_dataset):
    xla_dataset_dict = dict(np.load(xla_dataset[i]))
    node_count = len(xla_dataset_dict["node_opcode"])
    config_count = len(xla_dataset_dict["node_config_ids"])
    loss_train = model_training(xla_dataset_dict,node_count,config_count,model)
    print("loss train"+ str(loss_train))

  node_config_feat_tensor=torch.tensor(node_config_feature[i])


loss traintensor(11.2404, grad_fn=<AddBackward0>)
loss traintensor(58602.9492, grad_fn=<AddBackward0>)
loss traintensor(1796039., grad_fn=<AddBackward0>)
loss traintensor(2189371., grad_fn=<AddBackward0>)


KeyboardInterrupt: 