# Groq API - Fibonacci Tutorial
The following tutorial uses the Fibonacci sequence to demonstrate how program contexts can be used to create coresident models. To do this, multiple programs are compiled together and then loaded into the GroqChip memory.

By the end of this tutorial, you should be familiar with the following concepts:
* Program Contexts
* Shared Tensors / Storage Requests
* Input/Output Program (IOP) Files

It is expected that you have finished reading the <b>Coresident Models</b> section of the Groq API Tutorial Guide prior to going through this tutorial.

## Fibonacci Series Refresher
`t = a1 + a2`

`a1 = a2`

`a2 = t`

<b>Note:</b> for this version of the program, we will use scalars to keep the computation simple.
We'll break up the implementation into the 3 programs (using program contexts), as outlined below:

# GroqAPI Implementation:
- <b>Initialize:</b> The first program context that will receive two input values from the host and store them in memory. This memory location will be used by the other programs.
- <b>Iterate:</b> The second program context that receives the memory locations where the input data is stored, which will compute a single iteration of the Fibonacci sequence and update the contents in memory.
- <b>Offload:</b> The final program context that returns the values from the shared memory location where the most recent iteration in the Fibonacci sequence is stored.

While this example may seem trivial, it demonstrates that you can iterate on a program for N amount of times before unloading the results to the host. 

## Build Your Program
Begin by importing the Groq API and Python NumPy packages.

In [None]:
import os  # Used for package directory management
import shutil

import groq.api as g
from typing import Dict, Optional
print("Python packages imported successfully")

## Program Definition

This example includes multiple program contexts (i.e. Initialize, Iterate, and Offload). These program contexts are packaged together (within a <b>program package</b>) to run on a single GroqCard accelerator. As such, we'll configure our API program to use a single card topology.

First, we'll create a program context and pass in "init" as our first program name. This will allow us to define the compute graph for the initialization, which will receive two input values from the host and save them to memory.

Second, we'll create a program context for the "iterate" program, i.e. our second program context. This program is our longest as it will compute the addition of the two values saved in the memory location during the init program. It will also update the values in memory.

Lastly, we'll create one more program context for the "offload" program that will return the result from GroqChip memory to the host.

In [None]:
class IopFileNames:
    INIT = "init"
    ITERATE = "iterate"
    OFFLOAD = "offload"

def specify_program(program_package: g.ProgramPackage, prog_name: str):
    topo = g.configure_topology(config=g.TopologyConfig.DF_A14_1_CHIP)
    print(f"Building multi-chip program {prog_name} with {topo.name} topology.")
    
    # Create a new program context for the INIT program
    init_ctx = program_package.create_program_context(IopFileNames.INIT, topo)

    with init_ctx:
        shape = (1,)           # Define our tensor shape, in this case a scalar
        dtype = g.float16      # Define the desired data type

        # layout = west hemisphere, 2 slices for FP16, allocate the first 2 slices for the first scalar and the subsequent two slices for the second input
        a1_mt = g.input_tensor(shape, dtype, name="a1_input", layout="H1(W), -1, S2(0-1)")
        a2_mt = g.input_tensor(shape, dtype, name="a2_input", layout="H1(W), -1, S2(2-3)")

        a1_mt.is_static = True
        a2_mt.is_static = True

    # Next, we create the second program context for the ITERATE program
    iterate_ctx = program_package.create_program_context(IopFileNames.ITERATE, topo)

    with iterate_ctx:

        # Create shared memory tensors that reference the tensors from the init program
        # This copies a tensor's memory allocation from the init program context into the iterate_ctx
        shared_a1_mt = g.shared_memory_tensor(
            mem_tensor=a1_mt, name="shared_a1"
        )
        shared_a2_mt = g.shared_memory_tensor(
            mem_tensor=a2_mt, name="shared_a2"
        )
        
        # We use buffered resource scopes to easily schedule one operation after the other
        # The first resource scope will compute t = a1 + a2
        with g.ResourceScope(name="add", is_buffered=True, time=0) as add_scope:
            t_st = shared_a1_mt.add(shared_a2_mt, time=0) # Use the shared memory tensors since that's where the input values were stored
            t_mt = t_st.write(name="t_mt")
            t_mt.is_static = True
            
        # In the second resource scope, we'll update the memory location of a1 to equal the value of a2
        with g.ResourceScope(
            name="shuffle", is_buffered=True, time=None, predecessors=[add_scope]
        ) as shuffle_scope:
            # Read value from a2 memory location
            a1_st = a2_mt.read(streams=g.SG1[0], time=0)
            # Write to a2 value to a1 memory location
            a1_mt = a1_st.write()
            a1_mt.storage_request = shared_a1_mt.storage_request
            
        # Next, we update a2 to equal the value of t
        with g.ResourceScope(
            name="shuffle2",
            is_buffered=True,
            time=None,
            predecessors=[shuffle_scope],
        ) as shuffle_scope2:
            # Read from memory
            a2_st = t_mt.read(streams=g.SG1[0], time=0)
            # Write to memory
            a2_mt = a2_st.write()
            a2_mt.storage_request = shared_a2_mt.storage_request

    # Lastly, we create the OFFLOAD program that will read the final value of `t` from memory and return the result to the host, thereby "offloading" the final value. 
    offload_ctx = program_package.create_program_context(IopFileNames.OFFLOAD, topo)

    with offload_ctx:
        # Create a shared memory tensor referring to the t value in memory.
        result_mt = g.shared_memory_tensor(
            mem_tensor=t_mt, name="result_out"
        )
        # Mark as program output to be returned to host
        result_mt.set_program_output()

    # Compile the last program context
    program_package.compile_program_context(offload_ctx)

With the 3 programs defined in our `specify_program` function, we can instantiate it, passing the program name and package information. 

In [None]:
pkg_name = "fibonacci_package"  # Specify a name for your program package
pkg_dir = "./IOP"  # Specify a directory for the different IOP files generated for each device to be placed.

# The following checks that the package directory is empty
isdir = os.path.isdir(pkg_dir)
if isdir:
    shutil.rmtree(pkg_dir)

# Create Program Package
program_package = g.ProgramPackage(name=pkg_name, output_dir=pkg_dir)
print(f"Program package created: {pkg_name} at {pkg_dir}")

prog_name = "fibonacci_program"

# Call the function we defined in the previous step
specify_program(program_package, prog_name)

# The next step will generate a single IOP file that contains the 3 programs we defined.
program_package.assemble()
print(f"Assembled multi-device package {pkg_name}")

With our IOP file generated, let's take a look at what was actually created. To do this, we'll use the `iop-utils` tool that is included in the GroqWare Suite. This tool can be used to look at more details about the IOP file. For background, a single program in an IOP file includes the following entry points into the program:

0) Monolithic - Comprised of the next 3 entry points, i.e. Input, Compute, and Output.

1) Input - Loads inputs onto the GroqChip

2) Compute - Executes the program instructions in the compute graph

3) Output - Unloads the results from the GroqChip

In the IOP file we just generated, you should see three programs (init, iterate, offload) and each program should have four entry points (Mono, Input, Compute, Output). This is a great way to check that your programs are packaged correctly with the inputs and outputs defined as expected.


In [None]:
# !iop-utils io IOP/fibonacci_package.0.iop

While looking at the IOP file, let's create a class to define the entry points. We'll use this when we execute our programs. 

In [None]:
class EntryPoint: 
    MONO = 0
    INPUT = 1
    COMPUTE = 2
    OUTPUT = 3

# Running on Hardware
While the following code is included in a single notebook for user ease, in reality, the runtime execution portion is separate from the compilation portion. As such, we show all the code necessary for runtime execution as well, including specifying the imports needed.

In [None]:
import numpy as np
import groq.api as g
import groq.runtime.driver as runtime

### Define a Function to Invoke Our Program Package

The following function demonstrates how to invoke a program on the GroqCard by specifying the program package and the entry point. For this function, we'll specify the following:
* Device (i.e. the physical card we want the program to execute on), 
* The program to run (init, iterate, or offload), 
* The entry point (monolithic for this example, but it is also possible to call specific entry points), 
* Any input tensors (Dict) to pass from the host to the GroqCard (only needed for the init program)


In [None]:
def invoke(device, program, entry_point, tensors: Optional[Dict[str, np.ndarray]]=None):
    ep = program.entry_points[entry_point]
    input_buffer = runtime.BufferArray(ep.input, 1)[0]
    output_buffer = runtime.BufferArray(ep.output, 1)[0]
    if ep.input.tensors:
        for input_tensor in ep.input.tensors:
            if input_tensor.name not in tensors:
                raise ValueError(
                    f"Missing input tensor named {input_tensor.name}")
            input_tensor.from_host(tensors[input_tensor.name], input_buffer)
    device.invoke(input_buffer, output_buffer)
    outs = {}
    if ep.output.tensors:
        for output_tensor in ep.output.tensors:
            result_tensor = output_tensor.allocate_numpy_array()
            output_tensor.to_host(output_buffer, result_tensor)
            outs[output_tensor.name] = result_tensor
    return outs

## Reserve a GroqCard and Load IOP File
The following will introduce our Runtime API to reserve a GroqCard in your system, and load the program onto the reserved device.
By using `unsafe_keep_entry_points=True`, we instruct the runtime tools not to overwrite the existing set of entry points on the GroqChip when it does the load.
<b>Note:</b> The `unsafe_keep_entry_points=True` is named as such because, unless your program package was compiled/scheduled/linked to be compatible with the current load set, you may conflict with stale information that could hang the chip if invoked incorrectly.

In [None]:
iop_file = "./IOP/fibonacci_package.0.iop"
iop = runtime.IOProgram(iop_file)
programs = {program.name: program for program in iop}
device = runtime.devices[0]  # Assumes the first GroqCard in the system

# Open the device
device.open()

# Load each of the programs in the IOP file
device.load(programs[IopFileNames.INIT])
device.load(programs[IopFileNames.ITERATE], unsafe_keep_entry_points=True)
device.load(programs[IopFileNames.OFFLOAD], unsafe_keep_entry_points=True)

## Execute

We'll create some input data and then use the invoke function we defined earlier to execute each program that has been loaded on the GroqCard.

In [None]:
a1 = np.array(0, dtype=np.float16)
a2 = np.array(1, dtype=np.float16)
inputs = {"a1_input" : a1, "a2_input" : a2} #[a1, a2]

# invoke (device, program by name, entry point, inputs/outputs if there are any)
print("Invoking INPUT")
invoke(device, programs[IopFileNames.INIT], EntryPoint.INPUT, tensors=inputs) 

print("Invoking ITERATE")
# Execute compute: No input to the program is needed as the data is already on the chip
invoke(device, programs[IopFileNames.ITERATE], EntryPoint.COMPUTE)  

print("Invoking OFFLOAD")
# entry.pt 0x03 = outputs
program_3_output = invoke(device, programs[IopFileNames.OFFLOAD], EntryPoint.OUTPUT)
print(program_3_output)

## Check Our Work
We'll define a Fibonacci function to check the results from the GroqCard.

In [None]:
## Define Fibonacci to check the results from the GroqCard

def fibonacci(n):
    a = 0
    b = 1

    for i in range(n):
        c = a + b
        a = b
        b = c
    return b

## Execution in a loop

What becomes more interesting is that since the programs are loaded onto the GroqCard, you can execute the loop on any one of the programs, for example the "iterate" program to calculate more values in the Fibonacci sequence. 

For example...

In [None]:
# Define our inputs / starting values
a1 = np.array(0, dtype=np.float16)
a2 = np.array(1, dtype=np.float16)
inputs = {"a1_input" : a1, "a2_input" : a2} #[a1, a2]

# Set the number of iterations
loops = 6

# invoke (device, program by name,  entry point, inputs/outputs if there are any)
print("Invoking init")
invoke(device, programs[IopFileNames.INIT], EntryPoint.INPUT, tensors=inputs)

for x in range(loops):
    print("invoking iterate", x)
    invoke(device, programs[IopFileNames.ITERATE], EntryPoint.COMPUTE)

print("invoking offload")
program_3_output = invoke(device, programs[IopFileNames.OFFLOAD], EntryPoint.OUTPUT)
print("The Results from Fibonacci is...")
print(program_3_output["result_out"])

if (fibonacci(loops) != program_3_output["result_out"]) :
    raise Exception("Results do not match")
else :
    print(fibonacci(loops))

### GroqView (OPTIONAL)
GroqView can be used to view the instructions of your program in the GroqChip. When you click on an instruction, you can get the name of the Tensor API level operation. Note: it is expected that you are familiar with GroqView for the following section of this tutorial. See the GroqView User Guide for more details. 

Using the following command, we can create a .json file that can be used to view the program in hardware. This will show:
* What instructions occur
* Where on the chip they take place, as well as
* When in time (cycles) each instruction occurs.

To launch the GroqView tool, uncomment and run the following command. Remember, you still need to create a tunnel to the server running GroqView to load in another window.

<b>Note:</b> before proceeding to the next section, you'll want to stop this cell.

In [None]:
g.write_visualizer_data("fibonacci")
#!groqview fibonacci/visdata.json 

In the GroqView tool, you should see that the tensor is copied 2 bytes * 32-way SIMD per cycle over the course of 32 cycles.