# Groq API - Simple Multi-Chip Design

The following tutorial will demonstrate the use of the GroqAPI collectives for Groq RealScale™ (chip-to-chip) interconnect to use multiple GroqCard accelerators in a single program. 

By the end of this tutorial, you should feel comfortable with the following concepts:
* Program Contexts
* Groq RealScale™ interconnect between GroqCard accelerators
* Multi-Program Packages

It is expected that you have finished reading the Multi-Chip Design section of the Groq API Tutorial Guide prior to going through this tutorial. 
This design performs a matmul on Device 0, transmits the matmul results to Device 1 where a bias add in the VXM is performed. The final result are passed to the host. 

## Build Your Program
Begin by importing the following packages. Since we'll be using the matmul component from the Neural Net library, we import groq.api.nn as well. 

In [None]:
import os
import shutil
import numpy as np
import groq.api as g
import groq.api.nn as nn
import groq.runner.tsp as tsp


print("Python packages imported successfully")

### Step 1: 
Instantiate a program package to store the programs. The program package name will be used to refer to the collection of programs run on the devices and the package directory will be used to hold the files generated for the different GroqCard devices in the topology. Note: If the package directory already contains IOP files, when you build your program, you will receive an error ("Compiling a program on top of an existing program directory"). Either delete the contents of the directory or use a new folder name. 

In [None]:
pkg_name = "my_pkg"    # specify a name for your program package
pkg_dir = "./IOP"       # specify a directory for the different IOP files generated for each device to be placed. 

# The following checks that the package directory is empty
isdir = os.path.isdir(pkg_dir)
if isdir:
    shutil.rmtree(pkg_dir)

# Create Program Package
pgm_pkg = g.ProgramPackage(name=pkg_name, output_dir=pkg_dir)
print("Program package created: '"+pkg_name+"' at "+pkg_dir)

### Step 2:
Build your multi-chip program. This is the code that will describe what the compute function is for the GroqCard accelerators. 

In [None]:
def my_program(pgm_pkg, prog_name):
    # Setup multi-chip topology, the following specifies A1.4 GroqCard in a 4-way connection
    topo = g.configure_topology(config=g.TopologyConfig.DF_A14_4_CHIP, speed=25.78125)
    print("Building multi-chip program " +prog_name+" with " +topo.name+" topology ...")
    
    # Create a new program context.
    pg_ctx = pgm_pkg.create_program_context(prog_name, topo)

    with pg_ctx:
        shape = (320, 320)  # Define our tensor shape
        dtype = g.float16   # Define the desired data type

        # We'll begin by specifying the compute we want to take place on Device 0 in the topology
        with g.device(0):
            matrix1 = g.input_tensor(
                shape, dtype, name="inp_a", layout="H1(W), -1, S2"
            )
            matrix2 = g.input_tensor(
                shape, dtype, name="inp_b", layout="H1(W), -1, S16(4-38)"
            )
            mm = nn.MatMul(name="MyMatMul")

            with g.ResourceScope(name="mmscope", is_buffered=True, time=0) as mmscope :
                result_mt = mm(matrix1, matrix2, time=0).write(
                    name="mm_result", layout="H1(W), -1, S4"
                )
                g.add_mem_constraints([matrix1, matrix2], [result_mt], g.MemConstraintType.BANK_EXCLUSIVE)
            # The following resource scope will use the C2C (chip-to-chip) Broadcast collective to send the results from the matmul 
            # to Device 1. We could also add more devices in the devices list if we wanted to share the results with more devices. 
            with g.ResourceScope(
                name="broadcast", is_buffered=True, time=None, predecessors=[mmscope]
            ) as bcastscope:
                received_mmt = g.c2c_broadcast(
                    result_mt, devices=[g.device(1)], time=0
                )
                # The broadcast op will return a list of memory tensors that can be used to access the data in Device 1
        
        # Now, let's specify the compute we want to take place on Device 1 in the topology
        with g.device(1):
            bias_mt = g.input_tensor(shape, dtype=g.float32, name="bias")

            # We use a buffered resource scope to ensure that the bias add isn't applied until Device 1 has received the results of the matmul from Device 0. 
            with g.ResourceScope(name="biasscope", is_buffered=True, time=None, predecessors=[mmscope, bcastscope]) as biasscope :
                result_st = received_mmt[0].add(bias_mt, time=0)    # received_mmt is the returned list of memory tensors from the broadcast op. 
                result_mt = result_st.write(name="result")
            result_mt.set_program_output()      # This sets the program output to be the final result, thereby returning this value to the host. 

        return 

prog_name = "realscale_program"              # Give your program a name
my_program(pgm_pkg, prog_name)       # Instantiate your program passing the program package you created earlier and the name of your program

### Step 3:
Assemble all programs in the multi-device package. In this example, we have one program that uses 2 devices in the topology. However, you could have multiple programs using different program contexts. Regardless, once you've created your program, you call the package.assemble() to add the program to the package. When you add a program, it compiles the previous program checking that there are no conflicts in the resources allocated. This step is when the *.aa and the IOP files are generated. After running the following cell, you can click incto the /IOP folder to see the generated files. 

In [None]:
pgm_pkg.assemble()
print("Assembled multi-device package "+pkg_name)

### Step 4 (Optional): Bringup the Groq RealScale Links

This step is only needed if the Groq RealScale™ links are down. It wakes the links up and gets the GroqCard accelerator devices ready to run your program. 

In [None]:
tsp.bringup_topology(user_config=g.TopologyConfig.DF_A14_4_CHIP, speed=25)
#print("Bringup of Groq RealScale topology completed successfully.")

## Run on Hardware
Program the all the GroqChip devices in the topology with their respective binary files.

### Step 5:

Since we're creating a multi-device program, we're going to use the multi-tsp runner to load the program onto the GroqCard devices. 
Note: Device 0 in the topology is the first device in the list such that if your devices are named groqA0-A3, Device 0 will be groqA0,  Device 1 will be groqA1 and so on. The topology here is referred to be in instance 0 or pool 0. Similarly another 4-way topology can be found on groqA4-A7, which is instance 1 or pool 1. If you have 8 GroqCards (A0-A7) you have TWO 4-way topology i.e., 4-way with A0-A3 and a 4-way with A4-A7 that will run the same compute on both sets of cards. If the HW platform supports more than one such pool/instance, create_multi_tsp_runner will lookup and return the first available pool/instance for the program execution

In [None]:
runner = tsp.create_multi_tsp_runner(
    pkg_name, pkg_dir, prog_name, user_config=g.TopologyConfig.DF_A14_4_CHIP, speed=25
)
print("Multi-TSP Runner created successfully.")

### Step 6: 
Pass inputs to the runner and execute the program on HW. For multi-chip programs, the input is expected to be in a Python dict where the name of the tensor allocated in hardware is used to specify the input data. 

In [None]:
t1_data = np.random.rand(320, 320).astype(np.float16)  # matrix1 data
t2_data = np.random.rand(320, 320).astype(np.float16)  # matrix2 data
bias_data = np.random.random_sample(size=(320, 320)).astype(np.float32)  # bias add data, float32 to match the MXM result for FP16 matmul


input_data = {'inp_a': t1_data, 'inp_b': t2_data, 'bias': bias_data}
print("Executing program " +prog_name)
results = runner(**input_data)
print("Results are in!")

## Check Results

We'll use numpy to compare with the results we received from Groq hardware.

In [None]:
oracle = np.matmul(t1_data, t2_data.transpose(), dtype=np.float32) + bias_data
print("For input tensors of size {} x {}. Results are: ".format(t1_data.shape, t2_data.shape))
print(np.allclose(oracle, results['result'], rtol=1e-1, atol=1e-1, equal_nan=True))

### Follow-On Learning Challenge:

Try using more devices in the topology and one of the other chip-to-chip collectives, such as Scatter or Gather. 