<a href="https://colab.research.google.com/github/ngcxy/Systems-of-ML/blob/main/Output_Stationary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Global Variable Settings

We first initialize multiprocessing tools, generate random activations and weights, and print these values.

In [17]:
from multiprocessing import Process, Queue, Array
import random
import time

num_activations = 64
num_weights = 4

# Create activations randomly.
activations = [random.uniform(0, 1) for _ in range(num_activations)]

# The fixed weights. Each PE will be hardcoded to work with a single entry in this list.
weights = [random.uniform(0, 1) for _ in range(num_weights)]
print(activations)
print(weights)

[0.9741516625743291, 0.5086190822911398, 0.3177596484238183, 0.8524613491353215, 0.6193476347747476, 0.7656951004356596, 0.8587422000745983, 0.09721144140558635, 0.17787006277453477, 0.15212671371418773, 0.28241154490528175, 0.5214746700971236, 0.7500002024776957, 0.017848327027516375, 0.8155798921650892, 0.5717607692015005, 0.7266148324424019, 0.6886893544072334, 0.3087998201352089, 0.04927760744620879, 0.8810986252156062, 0.05295281496023474, 0.6507850945912833, 0.11809216087756302, 0.10224998965926901, 0.2669552014408276, 0.7070298616090833, 0.7198575419857357, 0.7980758597871257, 0.9125933494164008, 0.5082185443840702, 0.19437851162805286, 0.7785489133011408, 0.875713882607681, 0.5750779318790126, 0.031374795731412886, 0.5518112226980404, 0.3357508071013099, 0.009132224843999448, 0.8035964196827341, 0.4476354058618275, 0.11536830516971652, 0.5898339244799754, 0.9249963803629728, 0.958374255533157, 0.5996624666180775, 0.39587196537350744, 0.4300080887670935, 0.34435715252895893, 0.7

## Convolution Results

1-D convolution results for validation. The outputs must exactly match these values.

In [18]:
ground_truth = [0 for _ in range(num_activations-num_weights+1)]
for i in range(num_activations-num_weights+1):
  for k in range(num_weights):
      ground_truth[i] += weights[k]*activations[i+k]
print(ground_truth)

[1.4299752639541456, 1.0829551301902267, 1.0896032634315613, 1.5203730347910593, 0.9930601543482265, 0.9203737526044068, 0.7993152749675463, 0.3260653029265934, 0.5547449693055144, 0.7763904644439119, 0.6011658042395506, 1.0426058506267435, 1.142109345849558, 0.8222868958099587, 1.3728958121531025, 1.0102305475564564, 0.8442690387116951, 1.0955485502969675, 0.5582485733975552, 0.6501588241307801, 0.9401125085895283, 0.30685276403751394, 0.6915567708157903, 0.5962734609351557, 0.7797146971024926, 1.0649048689828575, 1.4809234788258885, 1.3226744504509993, 1.102410236936651, 1.3029285438149278, 1.1690093481202453, 0.9570897664359912, 1.0091872612697106, 1.1134103357978664, 0.7989383435791139, 0.28420374946065546, 0.945749722917869, 0.7624891496988013, 0.4305229776726728, 1.0755621129391228, 1.0692831858976009, 1.0794858896395214, 1.3269161333446544, 1.3502335963603151, 1.2320834105596068, 0.8782494636020226, 0.9087665164854444, 1.0491089638403286, 1.1781009663199202, 1.2618671959823535, 

## Validation Function
The `mse_error` function computes the Mean Square Error (MSE) between two sequences.

In [19]:
def mse_error(ground_truth, output):
    error = 0.0
    for gt, out in zip(ground_truth, output):
        error += (gt-out) ** 2
    print(f"The expected results : {ground_truth}")
    print(f"The simulated results: {output[:]}")
    print(f"The Mean Square Error: {error:.4f}")

## Output stationary data flow

We implement three data flow approaches to achieve the output stationary.

### OS type 1

The weights are broadcasted while activations are passed through the PEs sequentially.

Notice that the PEs only retrieve weights when the required activations are passed in. Before that, the weights will be stored in the buffer of the queues.

For each OS_worker, it gets the activation from the `activate_queue` in the "left". If the activation is within its multiplication range, it will get the weight from `weight_queue`, multiply them together, and accumulate this partial sum into the `partial_output_register`. Then, it passes the activation to the next PE through `output_queue`.

In [20]:
def OS_worker1(id, partial_output_register, activate_queue, weight_queue, output_queue):
    ## The PE function for the Output Stationary Scheme 1

    while True:
        # Get input activation
        data = activate_queue.get()

        # Check if the activation is a termination signal
        if data is None:
            output_queue.put(data)
            break

        activation, time_step = data
        # Each PE takes input within certain range of time step
        if time_step<id or time_step>id+num_weights-1:
            output_queue.put(data)
            continue

        # Get kernel weights
        weight = weight_queue.get()

        # Update the RegFile in the PE
        partial_output_register[id] += weight*activation

        # Pass the input data into the output_queue
        output_queue.put(data)

    print(f"Worker ID {id}: is Done!")

In [21]:
num_pes = num_activations - num_weights + 1

activate_queues = [Queue() for _ in range(num_pes+1)]
weight_queues = [Queue() for _ in range(num_pes)]

processes = []

# Create a global variable for the PE RegFiles
PE_RegFiles = Array('d', [0.0 for _ in range(num_pes)])

start_time = time.time()

# Create and start a process for each PE.
for i in range(num_pes):
    p = Process(target=OS_worker1, args=(
        i, PE_RegFiles, activate_queues[i], weight_queues[i], activate_queues[i+1]))
    processes.append(p)
    p.start()

# Broadcasting the weights to all PEs.
# The weights will be stored in the queues until the PE fetching valid activations
for i in range(num_weights):
    for w in weight_queues:
        w.put(weights[i])

print("Done broadasting Weights!")

# TODO: Pass the activation data input the leftmost PE
for i, activation in enumerate(activations):
    activate_queues[0].put((activation, i))

# Pass the termination signal in the end of the input sequence
activate_queues[0].put(None)
print("Done putting None!")

# Make sure to join the PE processes to clean up properly
for i, p in enumerate(processes):
    p.join()

end_time = time.time()

OS_output = PE_RegFiles

print("All Done!")
mse_error(ground_truth, OS_output)

runtime = end_time - start_time
print("Runtime:", runtime, "seconds")

Worker ID 0: is Done!Worker ID 1: is Done!

Worker ID 2: is Done!
Worker ID 3: is Done!Worker ID 4: is Done!
Worker ID 5: is Done!
Worker ID 6: is Done!
Worker ID 7: is Done!


Worker ID 8: is Done!Worker ID 9: is Done!Worker ID 10: is Done!
Worker ID 11: is Done!Worker ID 12: is Done!
Worker ID 13: is Done!Done broadasting Weights!
Done putting None!
Worker ID 14: is Done!



Worker ID 15: is Done!Worker ID 16: is Done!Worker ID 17: is Done!


Worker ID 18: is Done!Worker ID 19: is Done!
Worker ID 20: is Done!

Worker ID 21: is Done!Worker ID 24: is Done!Worker ID 23: is Done!Worker ID 22: is Done!Worker ID 25: is Done!
Worker ID 26: is Done!



Worker ID 27: is Done!
Worker ID 28: is Done!
Worker ID 29: is Done!
Worker ID 30: is Done!

Worker ID 31: is Done!
Worker ID 32: is Done!
Worker ID 33: is Done!Worker ID 34: is Done!Worker ID 35: is Done!


Worker ID 36: is Done!Worker ID 37: is Done!

Worker ID 38: is Done!
Worker ID 39: is Done!Worker ID 40: is Done!
Worker ID 41: is Done!


### OS type 2

The activations are broadcasted while weights are passed through the PEs sequentially.

For each OS_worker, it waits for the weights coming from the `weight_queue` in the left. During this time, it just gets the activation, does nothing, and continues. As soon as it receives a weight, it will get the current activation from `activation_queue`, multiply them together, and accumulate this partial sum into the `partial_output_register`. Then, it passes the weight to the next PE through `output_queue`.

In [8]:
def OS_worker2(id, partial_output_register, activate_queue, weight_queue, output_queue):
    ## The PE function for the Output Stationary Scheme 2

    while True:

        # Get input activation
        activation, time_step = activate_queue.get()

        # only process when there's weight com
        if time_step  < id:
            continue

        # Get kernel weight
        weight = weight_queue.get()

        # Check if the activation is a termination signal
        if weight is None:
            output_queue.put(weight)
            break

        # Update the RegFile in the PE
        partial_output_register[id] += weight*activation

        # Pass the input data into the output_queue
        output_queue.put(weight)

    print(f"Worker ID {id}: is Done!")

In [22]:
num_pes = num_activations - num_weights + 1

activate_queues = [Queue() for _ in range(num_pes)]
weight_queues = [Queue() for _ in range(num_pes+1)]

# List of processes: [PE[0]...PE[num_pes-1]]
processes = []

# Create a global variable for the PE RegFiles
PE_RegFiles = Array('d', [0.0 for _ in range(num_pes)])

start_time = time.time()

# Create and start a process for each PE.
for i in range(num_pes):
    p = Process(target=OS_worker2, args=(
        i, PE_RegFiles, activate_queues[i], weight_queues[i], weight_queues[i+1]))
    processes.append(p)
    p.start()

# Broadcasting the activations to all PEs.
for i,activation in enumerate(activations):
    for a in activate_queues:
        a.put((activation,i))
# Broadcasting one more redundant signal for the last PE to terminate
for a in activate_queues:
    a.put((0,num_pes))


print("Done broadasting Activations!")

# TODO: Pass the activation data input the leftmost PE
for weight in weights:
    weight_queues[0].put(weight)

# Pass the termination signal in the end of the input sequence
weight_queues[0].put(None)
print("Done passing None!")

# Make sure to join the PE processes to clean up properly
for i, p in enumerate(processes):
    p.join()

end_time = time.time()

OS_output = PE_RegFiles

print("All Done!")
mse_error(ground_truth, OS_output)

runtime = end_time - start_time
print("Runtime:", runtime, "seconds")

Worker ID 0: is Done!
Worker ID 3: is Done!Worker ID 2: is Done!Worker ID 1: is Done!Worker ID 4: is Done!
Worker ID 5: is Done!
Worker ID 8: is Done!Worker ID 7: is Done!Worker ID 9: is Done!



Worker ID 6: is Done!


Worker ID 10: is Done!Done broadasting Activations!
Done passing None!

Worker ID 12: is Done!Worker ID 11: is Done!

Worker ID 13: is Done!Worker ID 15: is Done!Worker ID 14: is Done!


Worker ID 16: is Done!Worker ID 17: is Done!

Worker ID 18: is Done!Worker ID 19: is Done!Worker ID 20: is Done!


Worker ID 21: is Done!
Worker ID 22: is Done!
Worker ID 23: is Done!
Worker ID 24: is Done!
Worker ID 25: is Done!
Worker ID 26: is Done!Worker ID 27: is Done!

Worker ID 28: is Done!Worker ID 30: is Done!
Worker ID 29: is Done!

Worker ID 32: is Done!Worker ID 31: is Done!Worker ID 33: is Done!

Worker ID 34: is Done!
Worker ID 36: is Done!Worker ID 35: is Done!


Worker ID 37: is Done!Worker ID 38: is Done!

Worker ID 39: is Done!
Worker ID 40: is Done!Worker ID 41: is Do

### OS type 3

The weights are broadcasted while activations are passed through the PEs sequentially.

Notice that all of the PEs will not start retrieving the broadcasted weight until a certain `time_step` (when the first activation arrives at the last PE). Further, the IDs of the PEs in this approach are arranged in a reversed order.


For each OS_worker, it gets the activation from the `activate_queue` in the "left" and keeps passing the activation to the next PE through the `output_queue`. As soon as the last PE receives an activation, it'll start retrieving the broadcasted weight in `weight_queue` and accumulate the multiplication result into the `partial_output_register`.

In [10]:
def OS_worker3(id, partial_output_register, activate_queue, weight_queue, output_queue):
    ## The PE function for the Output Stationary Scheme 1

    while True:
        # Get input activation
        data = activate_queue.get()

        # Check if the activation is a termination signal

        activation, time_step = data
        # Each PE takes input within certain range of time step
        if time_step<id:
            output_queue.put(data)
            continue

        # Get kernel weights
        weight = weight_queue.get()

        if weight is None:
            break;

        # Update the RegFile in the PE
        partial_output_register[id] += weight*activation

        # Pass the input data into the output_queue
        output_queue.put(data)

    print(f"Worker ID {id}: is Done!")

In [23]:
num_pes = num_activations - num_weights + 1

activate_queues = [Queue() for _ in range(num_pes+1)]
weight_queues = [Queue() for _ in range(num_pes)]

processes = []

# Create a global variable for the PE RegFiles
PE_RegFiles = Array('d', [0.0 for _ in range(num_pes)])

start_time = time.time()

# Create and start a process for each PE.
for i in range(num_pes):
    p = Process(target=OS_worker3, args=(
        num_pes-1-i, PE_RegFiles, activate_queues[i], weight_queues[i], activate_queues[i+1]))
    processes.append(p)
    p.start()

# Broadcasting the weights to all PEs.
# The weights will be stored in the queues until the PE fetching valid activations
for i in range(num_weights):
    for w in weight_queues:
        w.put(weights[i])

print("Done broadasting Weights!")

# Pass the activation data input the leftmost PE
for i, activation in enumerate(activations):
    activate_queues[0].put((activation, i))
# Pass one more redundant activations for the PEs to receive terminate signal
activate_queues[0].put((0, num_pes))

# Pass the termination signal in the end of the input sequence
for i in range(num_weights):
    for w in weight_queues:
        w.put(None)
print("Done broadcasting None!")

# Make sure to join the PE processes to clean up properly
for i, p in enumerate(processes):
    p.join()

end_time = time.time()

OS_output = PE_RegFiles

print("All Done!")
mse_error(ground_truth, OS_output)

runtime = end_time - start_time
print("Runtime:", runtime, "seconds")

Worker ID 60: is Done!Worker ID 59: is Done!Worker ID 58: is Done!
Worker ID 55: is Done!Worker ID 54: is Done!

Worker ID 52: is Done!Worker ID 53: is Done!Worker ID 57: is Done!Worker ID 56: is Done!



Worker ID 51: is Done!


Done broadasting Weights!
Done broadcasting None!
Worker ID 49: is Done!
Worker ID 50: is Done!
Worker ID 48: is Done!Worker ID 47: is Done!

Worker ID 45: is Done!
Worker ID 46: is Done!
Worker ID 43: is Done!Worker ID 44: is Done!Worker ID 42: is Done!


Worker ID 40: is Done!Worker ID 41: is Done!Worker ID 39: is Done!

Worker ID 38: is Done!
Worker ID 37: is Done!
Worker ID 36: is Done!

Worker ID 35: is Done!Worker ID 34: is Done!
Worker ID 32: is Done!Worker ID 33: is Done!

Worker ID 30: is Done!Worker ID 31: is Done!


Worker ID 29: is Done!
Worker ID 28: is Done!Worker ID 27: is Done!


Worker ID 26: is Done!Worker ID 25: is Done!
Worker ID 24: is Done!
Worker ID 23: is Done!
Worker ID 21: is Done!Worker ID 22: is Done!

Worker ID 20: is Done!Worker I