# Lab 4

In [None]:
from loaders import *
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.sparse as ss
import ruamel.yaml as yaml

## Part 6: Designing a Sparsity-Aware Matrix Multiplication System

For this section, you are given a template sparse matrix multiplication accelerator. Your job is to figure out the best dataflow to use with on 5 sparse matrix multiplication workloads given three different compression formats - you will be optimizing to minimize energy and cycles.  


We start by showing you how to run the simulator with an example mapping: 

In [None]:
example_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    DRAM_permutation=['K', 'M', 'N'],
    Buffer_factor_K=1,
    Buffer_factor_M=4,
    Buffer_factor_N=16,
    Buffer_permutation=['K', 'M', 'N'],
    PE_factor_K=1,
    PE_factor_M=4,
    PE_factor_N=1,
    PE_permutation=['K', 'M', 'N'],
    sparse_format='COO' # B, COO, CSR
) # CHANGE ME!

# DO NOT CHANGE THIS 
if example_config['sparse_format'] == 'B':
    sparse_opt = 'part6/sparse-opt_B.yaml'
    example_config['metadata_datawidth']=1
elif example_config['sparse_format'] == 'COO':
    sparse_opt = 'part6/sparse-opt_COO.yaml'
    example_config['metadata_datawidth']=16 
elif example_config['sparse_format'] == 'CSR':
    sparse_opt = 'part6/sparse-opt_CSR.yaml'
    example_config['metadata_datawidth']=16 

run_timeloop_model(
    example_config,
    alt_top='part6/top.yaml.jinja2',
    constraints='part6/constraints_global.yaml',
    problem='part6/problem3.yaml',
    mapping='part6/example_map.yaml',
    sparse_optimizations=sparse_opt
)
stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
print(stats)

You are given the following architecture:

In [None]:
show_config('part6/arch.yaml')

You are given the following compression formats:
1. Bitmask (part6/sparse-opt_B.yaml)
2. Compressed Sparse Row (part6/sparse-opt_CSR.yaml)
3. Coordinate List (part6/sparse-opt_COO.yaml)

In [None]:
show_config('part6/sparse-opt_B.yaml')

In [None]:
show_config('part6/sparse-opt_CSR.yaml')

In [None]:
show_config('part6/sparse-opt_COO.yaml')

You are tasked with optimizing the following 5 workloads:

In [None]:
show_config('part6/problem1.yaml')

In [None]:
show_config('part6/problem2.yaml')

In [None]:
show_config('part6/problem3.yaml')

In [None]:
show_config('part6/problem4.yaml')

In [None]:
show_config('part6/problem5.yaml')

Please note that although the problems are listed as "structured" for Timeloop simulation purposes, you should not assume that tiles are guaranteed to have the same occupancy. 

## Part 6.1: Workload-specific Compression Format and Dataflow Optimization

You may use different mappings, different dataflows, and different compression formats for different workloads. Report the dataflow and compression format you used and the average energy/latency you achieve across each of the 5 workloads. Your goal is to meet the described performance constraint for each of the 5 workloads. 

The performance constraints are described in terms of EDP (J*cycles): 


| Problem | Half Marks Threshold | Full Marks Threshold |
|---------|----------------------|----------------------|
| 1       | 2.0e+11             | 9.1e+10             |
| 2       | 1.5e+10             | 8.4e+9             |
| 3       | 1.5e+6             | 4.9e+5             |
| 4       | 1.0e+10             | 5.5e+9             |
| 5       | 1.2e+8             | 8.5e+7             |

### Performance Grading
+1 point is given for meeting the half marks threshold and +1 point is given for meeting the full marks threshold. Up to +2 point will be assigned for each workload where you meet the performance constraint, up to a maximum of +10 points. 

In [145]:
# Set your configs here
problem1_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    DRAM_permutation=['K', 'M', 'N'],
    Buffer_factor_K=1,
    Buffer_factor_M=8,
    Buffer_factor_N=8,
    Buffer_permutation=['K', 'M', 'N'],
    PE_factor_K=1,
    PE_factor_M=2,
    PE_factor_N=2,
    PE_permutation=['K', 'M', 'N'],
    sparse_format='CSR'# B, COO, CSR
)  # CHANGE ME!


problem2_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    DRAM_permutation=['K', 'N', 'M'],
    Buffer_factor_K=1,
    Buffer_factor_M=8,
    Buffer_factor_N=8,
    Buffer_permutation=['K', 'N', 'M'],
    PE_factor_K=1,
    PE_factor_M=2,
    PE_factor_N=2,
    PE_permutation=['K', 'N', 'M'],
    sparse_format='CSR' # B, COO, CSR
) # CHANGE ME!

problem3_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    DRAM_permutation=['K','M','N'],
    Buffer_factor_K=1,
    Buffer_factor_M=4,
    Buffer_factor_N=16,
    Buffer_permutation=['K','M','N'],
    PE_factor_K=1,
    PE_factor_M=4,
    PE_factor_N=1,
    PE_permutation=['K','M','N'],
    sparse_format='COO' # B, COO, CSR
)  # CHANGE ME!

problem4_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    DRAM_permutation=['K','N','M'],
    Buffer_factor_K=1,
    Buffer_factor_M=4,
    Buffer_factor_N=16,
    Buffer_permutation=['K','N','M'],
    PE_factor_K=1,
    PE_factor_M=4,
    PE_factor_N=1,
    PE_permutation=['K','N','M'],
    sparse_format='CSR' # B, COO, CSR
)  # CHANGE ME!

problem5_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=1024,
    DRAM_factor_N=16,
    DRAM_permutation=['K','M','N'],
    Buffer_factor_K=4,
    Buffer_factor_M=16,
    Buffer_factor_N=1,
    Buffer_permutation=['K','M','N'],
    PE_factor_K=1,
    PE_factor_M=1,
    PE_factor_N=4,
    PE_permutation=['K','M','N'],
    sparse_format='B' # B, COO, CSR
)  # CHANGE ME!


### Question 6.1
Run the autograding script that checks your results against our set thresholds

In [None]:
# DO NOT CHANGE THIS BOX
latencies = [np.inf, np.inf, np.inf, np.inf, np.inf]
energies = [np.inf, np.inf, np.inf, np.inf, np.inf]

for i, (config, problem) in enumerate(zip([problem1_config, problem2_config, problem3_config, problem4_config, problem5_config], ['part6/problem1.yaml', 'part6/problem2.yaml', 'part6/problem3.yaml', 'part6/problem4.yaml', 'part6/problem5.yaml'])):
    if config['sparse_format'] == 'B':
        sparse_opt = 'part6/sparse-opt_B.yaml'
        config['metadata_datawidth'] = 1
    elif config['sparse_format'] == 'COO':
        sparse_opt = 'part6/sparse-opt_COO.yaml'
        config['metadata_datawidth'] = 16
    elif config['sparse_format'] == 'CSR':
        sparse_opt = 'part6/sparse-opt_CSR.yaml'
        config['metadata_datawidth'] = 16
        
    
    result = run_timeloop_model(
        config,
        alt_top='part6/top.yaml.jinja2', 
        constraints='part6/constraints_global.yaml', 
        problem=problem, 
        mapping='part6/example_map.yaml', 
        sparse_optimizations=sparse_opt 
    )

    stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
    latency = result.cycles
    energy = result.energy 
    latencies[i] = latency
    energies[i] = energy
    
    print(latencies, energies)

edp = [(latency*energy) / 1e6 for latency, energy in zip(latencies, energies)]
assert(len(edp) == 5)

print(edp)

half_marks_thresholds = [2.0e+11, 1.5e+10, 1.5e+6, 1.0e+10, 1.2e+8]
full_marks_thresholds = [9.1e+10, 8.4e+9, 4.9e+5, 5.5e+9, 8.5e+7]

answer(
    question='6.1',
    subquestion=f'Problem 1. Minimum EDP achieved: {edp[0]:.2e} J * cycles. Passing half marks threshold?',
    answer=edp[0] < half_marks_thresholds[0],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 1. Minimum EDP achieved: {edp[0]:.2e} J * cycles. Passing full marks threshold?',
    answer=edp[0] < full_marks_thresholds[0],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 2. Minimum EDP achieved: {edp[1]:.2e} J * cycles. Passing half marks threshold?',
    answer=edp[1] < half_marks_thresholds[1],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 2. Minimum EDP achieved: {edp[1]:.2e} J * cycles. Passing full marks threshold?',
    answer=edp[1] < full_marks_thresholds[1],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 3. Minimum EDP achieved: {edp[2]:.2e} J * cycles. Passing half marks threshold?',
    answer=edp[2] < half_marks_thresholds[2],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 3. Minimum EDP achieved: {edp[2]:.2e} J * cycles. Passing full marks threshold?',
    answer=edp[2] < full_marks_thresholds[2],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 4. Minimum EDP achieved: {edp[3]:.2e} J * cycles. Passing half marks threshold?',
    answer=edp[3] < half_marks_thresholds[3],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 4. Minimum EDP achieved: {edp[3]:.2e} J * cycles. Passing full marks threshold?',
    answer=edp[3] < full_marks_thresholds[3],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 5. Minimum EDP achieved: {edp[4]:.2e} J * cycles. Passing half marks threshold?',
    answer=edp[4] < half_marks_thresholds[4],
    required_type=bool
)
answer(
    question='6.1',
    subquestion=f'Problem 5. Minimum EDP achieved: {edp[4]:.2e} J * cycles. Passing full marks threshold?',
    answer=edp[4] < full_marks_thresholds[4],
    required_type=bool
)

## Part 6.2: Choosing a Dataflow and Compression Format

Often times, the dataflow and compression format in a real accelerator is decided at design time and cannot be changed for different workloads. You may use different mappings for different workloads, but must use the same dataflow and compression format across all of them. Report the dataflow and compression format you used and the average energy/latency you achieve across the 5 workloads. Overall, what dataflow and compression format do you recommend? Why do you think it does so well for these workloads? 

### Performance Grading
Your grade will be determined by the average energy and latency over the 5 workloads. 

The performance constraints are described in terms of EDP (J*cycles): 

| Problem | Half Marks Threshold | Full Marks Threshold |
|---------|----------------------|----------------------|
| 1       | 2.0e+11             | 1.3e+11             |
| 2       | 1.5e+10             | 1.5e+10            |
| 3       | 1.5e+6             | 1.1e+6             |
| 4       | 1.0e+10             | 6.0e+9             |
| 5       | 1.2e+8             | 9.0e+7             |


### Recommended Compression Format
- **Chosen Compression Format:**  
  _[Enter the chosen compression format]_  

### Recommended Dataflow
- **Chosen Dataflow:**

  _[Enter the chosen dataflow]_  

### Sparsity-Aware Matrix Multiplication System: Results
| Workload | Mapping Description | 
|----------|---------------------|
| 1        |  | 
| 2        |  | 
| 3        |  | 
| 4        |  | 
| 5        |  | 


### Justification for Recommendation
_[Explain why the chosen compression format and dataflow work best for the given workloads. Highlight factors like data shape, sparsity, data reuse, memory hierarchy, or any other observations.]_


In [None]:
# CHANGE ME: Set your configs here

global_config = dict(
    DRAM_permutation=['K', 'M', 'N'],
    Buffer_permutation=['K', 'M', 'N'],
    PE_permutation=['K','M','N'],
    sparse_format='CSR' # B, COO, CSR
) # CHANGE ME!

# DO NOT CHANGE THIS
if global_config['sparse_format'] == 'B':
    sparse_opt = 'part6/sparse-opt_B.yaml'
    global_config['metadata_datawidth'] = 1
elif global_config['sparse_format'] == 'COO':
    sparse_opt = 'part6/sparse-opt_COO.yaml'
    global_config['metadata_datawidth'] = 16
elif global_config['sparse_format'] == 'CSR':
    sparse_opt = 'part6/sparse-opt_CSR.yaml'
    global_config['metadata_datawidth'] = 16

shared_problem1_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    Buffer_factor_K=1,
    Buffer_factor_M=8,
    Buffer_factor_N=8,
    PE_factor_K=1,
    PE_factor_M=2,
    PE_factor_N=2,
) # CHANGE ME!

shared_problem2_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    Buffer_factor_K=1,
    Buffer_factor_M=8,
    Buffer_factor_N=8,
    PE_factor_K=1,
    PE_factor_M=2,
    PE_factor_N=2
) # CHANGE ME!

shared_problem3_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    Buffer_factor_K=1,
    Buffer_factor_M=4,
    Buffer_factor_N=16,
    PE_factor_K=1,
    PE_factor_M=4,
    PE_factor_N=1,
) # CHANGE ME!

shared_problem4_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=256,
    DRAM_factor_N=256,
    Buffer_factor_K=1,
    Buffer_factor_M=4,
    Buffer_factor_N=16,
    PE_factor_K=1,
    PE_factor_M=4,
    PE_factor_N=1
) # CHANGE ME!

shared_problem5_config = dict(
    DRAM_factor_K=4096,
    DRAM_factor_M=1024,
    DRAM_factor_N=16,
    Buffer_factor_K=4,
    Buffer_factor_M=16,
    Buffer_factor_N=1,
    PE_factor_K=1,
    PE_factor_M=1,
    PE_factor_N=4,
) # CHANGE ME!


### Question 6.2
Run the autograding script that checks your results against our set thresholds

In [None]:
# DO NOT CHANGE THIS BOX
latencies = [np.inf, np.inf, np.inf, np.inf, np.inf]
energies = [np.inf, np.inf, np.inf, np.inf, np.inf]

for i, (config, problem) in enumerate(zip([shared_problem1_config, shared_problem2_config, shared_problem3_config, shared_problem4_config, shared_problem5_config], ['part6/problem1.yaml', 'part6/problem2.yaml', 'part6/problem3.yaml', 'part6/problem4.yaml', 'part6/problem5.yaml'])):
    config = {
        **config,
        **global_config
    }

    if config['sparse_format'] == 'B':
        sparse_opt = 'part6/sparse-opt_B.yaml'
        config['metadata_datawidth'] = 1
    elif config['sparse_format'] == 'COO':
        sparse_opt = 'part6/sparse-opt_COO.yaml'
        config['metadata_datawidth'] = 16
    elif config['sparse_format'] == 'CSR':
        sparse_opt = 'part6/sparse-opt_CSR.yaml'
        config['metadata_datawidth'] = 16
        
    result = run_timeloop_model(
        config,
        alt_top='part6/top.yaml.jinja2', 
        constraints='part6/constraints_global.yaml',
        problem=problem, 
        mapping='part6/example_map.yaml', 
        sparse_optimizations=sparse_opt 
    )

    stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
    latency = result.cycles
    energy = result.energy 
    latencies[i] = latency
    energies[i] = energy
    
    print(latencies, energies)


shared_edp = [(latency*energy) / 1e6 for latency, energy in zip(latencies, energies)]
assert(len(edp) == 5)

shared_half_marks_thresholds = [2.0e+11, 1.5e+10, 1.5e+6, 1.0e+10, 1.2e+8]
shared_full_marks_thresholds = [1.3e+11, 1.5e+10, 1.1e+6, 6e+9, 9e+7]
print(shared_edp)

answer(
    question='6.2',
    subquestion=f'Problem 1. Minimum EDP achieved: {shared_edp[0]:.2e} J * cycles. Passing half marks threshold?',
    answer=shared_edp[0] < shared_half_marks_thresholds[0],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 1. Minimum EDP achieved: {shared_edp[0]:.2e} J * cycles. Passing full marks threshold?',
    answer=shared_edp[0] < shared_full_marks_thresholds[0],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 2. Minimum EDP achieved: {shared_edp[1]:.2e} J * cycles. Passing half marks threshold?',
    answer=shared_edp[1] < shared_half_marks_thresholds[1],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 2. Minimum EDP achieved: {shared_edp[1]:.2e} J * cycles. Passing full marks threshold?',
    answer=shared_edp[1] < shared_full_marks_thresholds[1],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 3. Minimum EDP achieved: {shared_edp[2]:.2e} J * cycles. Passing half marks threshold?',
    answer=shared_edp[2] < shared_half_marks_thresholds[2],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 3. Minimum EDP achieved: {shared_edp[2]:.2e} J * cycles. Passing full marks threshold?',
    answer=shared_edp[2] < shared_full_marks_thresholds[2],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 4. Minimum EDP achieved: {shared_edp[3]:.2e} J * cycles. Passing half marks threshold?',
    answer=shared_edp[3] < shared_half_marks_thresholds[3],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 4. Minimum EDP achieved: {shared_edp[3]:.2e} J * cycles. Passing full marks threshold?',
    answer=shared_edp[3] < shared_full_marks_thresholds[3],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 5. Minimum EDP achieved: {shared_edp[4]:.2e} J * cycles. Passing half marks threshold?',
    answer=shared_edp[4] < shared_half_marks_thresholds[4],
    required_type=bool
)
answer(
    question='6.2',
    subquestion=f'Problem 5. Minimum EDP achieved: {shared_edp[4]:.2e} J * cycles. Passing full marks threshold?',
    answer=shared_edp[4] < shared_full_marks_thresholds[4],
    required_type=bool
)


A junior engineer has taken the initiative and proposed a new architecture for a sparse matrix multiplication accelerator and written a simulator for it of their own design. 


### Question 6.3

A junior engineer has proposed and an architecture for a sparse matrix multiplication accelerator and simulated the energy and latency of a sparse matrix multiplication operation $Z_{m,n} = A_{m,k} x B_{k,n}$ with 50% sparsity in operand $A$. $M = 1024, N = 1024, K = 1024$. They claim that the results show that the energy and latency are exactly half that of performing the same operation on an equivalent dense matrix multiplication accelerator. 

In [None]:
answer(
    question='6.3',
    subquestion='Is this claim realistic?',
    answer= False,
    required_type=bool
)

### Question 6.4

The junior engineer has changed their architecture and simulator. They now claim that the results show that the energy has reduced but the latency has not changed compared to the dense matrix multiplication accelerator.


In [None]:
answer(
    question='6.4',
    subquestion='Is this claim realistic?',
    answer= True,
    required_type=bool
)

### Question 6.5

The junior engineer has changed their architecture and simulator again. They now claim that the results show that the the energy and latency are greater than that of a dense matrix multiplication accelerator.

In [None]:
answer(
    question='6.5',
    subquestion='Is this claim realistic?',
    answer= False,
    required_type=bool
)

### Question 6.6

The junior engineer has changed their architecture and simulator again. They now claim that the results show that the the energy is greater but the latency is less than that of a dense matrix multiplication accelerator.

In [None]:
answer(
    question='6.6',
    subquestion='Is this claim realistic?',
    answer= True,
    required_type=bool
)