# Answers - OC Lab 2

## 2.1

In [5]:
from typing import List
from tabulate import tabulate

table_rows: List[List[str | float]] = [["Array Size", "Avg Elapsed Time (s)", "Number of accesses", "Avg Access Time (ns)"]]

with open("lab2_kit/spark/spark.log") as f:
    next(f)  # ignore header

    array_size = None
    elapsed_times = []
    access_times = []

    def close_block():
        global array_size, elapsed_times, access_times, num_accesses

        if array_size is None:
            return

        avg_elapsed_time = sum(elapsed_times) / len(elapsed_times)
        avg_access_time = sum(access_times) / len(access_times)

        table_rows.append([array_size, avg_elapsed_time, num_accesses, avg_access_time])

        array_size = None
        elapsed_times = []
        access_times = []

    for line in f:
        if line.startswith("[LOG]:"):
            close_block()
            array_size = line[line.find("size ") + len("size ") :]
        elif array_size is not None:
            _, _, elapsed, _, access, num_accesses = line.split("\t")
            elapsed_times.append(float(elapsed))
            access_times.append(float(access))
        else:
            print("Error: malformed logfile")
            exit(1)
    
    close_block()

tabulate(table_rows, headers="firstrow", tablefmt="html")


Array Size,Avg Elapsed Time (s),Avg Access Time (ns)
4 KiB,0.00134333,3.27936
8 KiB,0.00197331,2.40878
16 KiB,0.00390036,2.38066
32 KiB,0.00777907,2.37396
64 KiB,0.0185227,2.82638
128 KiB,0.0411921,3.14271
256 KiB,0.0912179,3.47969
512 KiB,0.190617,3.63574
1024 KiB,0.379658,3.6207
2048 KiB,0.733716,3.49863


| **Array Size**         | 4 KiB | 8 KiB | 16 KiB | 32 KiB | 64 KiB | 128 KiB |
| ---------------------- | ----- | ----- | ------ | ------ | ------ | ------- |
| **t2-t1**              |       |       |        |        |        |         |
| **# accesses a[i]**    |       |       |        |        |        |         |
| **# mean access time** |       |       |        |        |        |         |


lab7p2

## 2.2

The cache size is 64KB because from there the reading and writing times increase in a disproportional manner - there is a clear spike between 64KB and 128KB, due to an increase in capacity misses.

## 2.3

texto aqui

## 2.4

texto aqui

## 3.1

### 3.1.1

#### a)

During the program's execution, the analyzed events will be L1 Data Cache misses - there's a trigger every time there's a miss in the L1 Data Cache.

#### b)

![Plot](assets/3.1.1-b.png)

Tabela: está em cm1.out, depois vamos lá buscar os valores :fixe3:

#### c)

##### L1 size:

From 32KB and above, the average miss rate goes way above what was happening previously. We can therefore assume that the whole array fit in there previously (not fitting there anymore), and that as the array size surpasses that value, the misses start to "flood" us.

##### Block size:

The block size is directly related with the stride utilized in the algorithm - the stride ends up telling us how many words we skip. Here, a "word" is a `uint8_t`, a Byte, so each stride is a Byte skipped. As we can see, for a cache size of 64KiB, the miss rate steadily increases as the block size increases, reaching 1 at 64B (and keeping steady up until 4KiB). This is because with a block size of 64B, we're essentially always loading up a new block into cache with a 64-word stride (since the word we're looking for is guaranteed not to be there). It can also be noted that for strides of 8, 16 and 32 words, the miss rate also grows from 12.5 to 25 to 50%, effectively doubling the miss rate for each stride increase - if we jump in groups of 8 words with a 64B block size, we're bound to have to load a new board every 8 times, and so on.

##### Associativity set size:

texto aqui

### 3.1.2

#### a)

We changed both the event being tracked to `PAPI_L2_DCM`, to be able to track L2 Data Cache misses now, and the `CACHE_MIN` and `CACHE_MAX` values, respectively to 64KiB and 1MiB, to be able to track the L2 cache. Do note that the 64KiB value was explicitly chosen as to start right one power of 2 above the expected L1 cache size, since the L2 cache is supposed to always be bigger than the L1 cache.

#### b)

![Plot](assets/3.1.2-b.png)

#### c)

##### L2 size:

For the same reasons described in 3.1.1 c)'s L1 size section, the L2 cache size seems to be 256KiB, since with the array size going above that, the miss rate grows in a disproportional manner: the array doesn't fit as a whole anymore, which leads to misses starting to happen.

##### Block size:

For the same reasons described in 3.1.1 c)'s Block size section, the L2 cache's block size also seems to be 64B.

##### Associativity set size:

texto aqui

## 3.2

### 3.2.1

#### a)

We have two $512 \times 512$ `uint16_t` matrices: each matrix occupies $512^2 \times 2 = 2^{19}$ Bytes = $512$ KB, so the two of them combined occupy $2^{20}$ Bytes, or $1$ MB, in memory.

#### b)

Program output:

```
After resetting counter 'PAPI_L1_DCM' [x10^6]: 0.000000
After resetting counter 'PAPI_LD_INS' [x10^6]: 0.000000
After resetting counter 'PAPI_SR_INS' [x10^6]: 0.000000
After stopping counter 'PAPI_L1_DCM'  [x10^6]: 134.444855
After stopping counter 'PAPI_LD_INS'  [x10^6]: 3491.023749
After stopping counter 'PAPI_SR_INS'  [x10^6]: 672.141375
Wall clock cycles [x10^6]: 3995.673182
Wall clock time [seconds]: 1.177878
Matrix checksum: 2717908992
```

| **Total number of L1 data cache misses**                |  134.444855   | **⨯ 10^6**  |
| ------------------------------------------------------- | --- | ----------- |
| **Total number of load / store instructions completed** |  3491.023749 + 672.141375  | **⨯ 10^6**  |
| **Total number of clock cycles**                        |  3995.673182   | **⨯ 10^6**  |
| **Elapsed time**                                        |  1.177878   | **seconds** |


#### c)

$$
\operatorname{HitRate} = 1 - \operatorname{MissRate} = 1 - \frac{\operatorname{Misses}}{\operatorname{Accesses}} = 1 - \frac{134.444855}{3491.023749 + 672.141375} = 0.9677
$$

### 3.2.2

#### a)

Program output:

```
After resetting counter 'PAPI_L1_DCM' [x10^6]: 0.000000
After resetting counter 'PAPI_LD_INS' [x10^6]: 0.000000
After resetting counter 'PAPI_SR_INS' [x10^6]: 0.000000
After stopping counter 'PAPI_L1_DCM'  [x10^6]: 4.212926
After stopping counter 'PAPI_LD_INS'  [x10^6]: 402.664929
After stopping counter 'PAPI_SR_INS'  [x10^6]: 134.217780
Wall clock cycles [x10^6]: 744.145336
Wall clock time [seconds]: 0.219365
Matrix checksum: 2717908992
```

| **Total number of L1 data cache misses**                |  4.212926   | **⨯ 10^6**  |
| ------------------------------------------------------- | --- | ----------- |
| **Total number of load / store instructions completed** |  402.664929 + 134.217780   | **⨯ 10^6**  |
| **Total number of clock cycles**                        |  744.145336   | **⨯ 10^6**  |
| **Elapsed time**                                        |   0.219365  | **seconds** |

#### b)

$$
\operatorname{HitRate} = 1 - \operatorname{MissRate} = 1 - \frac{\operatorname{Misses}}{\operatorname{Accesses}} = 1 - \frac{4.212926}{402.664929 + 134.217780} = 0.99215
$$

#### c)

Program output:

```
After resetting counter 'PAPI_L1_DCM' [x10^6]: 0.000000
After resetting counter 'PAPI_LD_INS' [x10^6]: 0.000000
After resetting counter 'PAPI_SR_INS' [x10^6]: 0.000000
After stopping counter 'PAPI_L1_DCM'  [x10^6]: 4.484165
After stopping counter 'PAPI_LD_INS'  [x10^6]: 402.925461
After stopping counter 'PAPI_SR_INS'  [x10^6]: 134.479925
Wall clock cycles [x10^6]: 744.901308
Wall clock time [seconds]: 0.219588
Matrix checksum: 2717908992
```

| **Total number of L1 data cache misses**                |  4.484165   | **⨯ 10^6**  |
| ------------------------------------------------------- | --- | ----------- |
| **Total number of load / store instructions completed** |  402.925461 + 134.479925  | **⨯ 10^6**  |
| **Total number of clock cycles**                        | 744.901308    | **⨯ 10^6**  |
| **Elapsed time**                                        | 0.219588    | **seconds** |


Even though all values went up, it was only by a slight margin, since the time complexity of matrix transposition (quadratic) is way smaller in comparison with the one associated with the matrix multiplication (cubic).

#### d)

| **$\Delta\text{HitRate} = \text{HitRate}_{\text{mm2}} - \text{HitRate}_{\text{mm1}}$**         | 0.99215 - 0.9677 = 0.02445 |
| ---------------------------------------------------------------------------------------------- | --- |
| **$\text{Speedup(\#Clocks)} = \frac{\text{Clocks}_{\text{mm1}}}{\text{Clocks}_{\text{mm2}}}$** | $\frac{3995.673182}{744.145336} = 5.369479574$    |
| **$\text{Speedup(Time)} = \frac{\text{Time}_{\text{mm1}}}{\text{Time}_{\text{mm2}}}$**         | $\frac{1.177878}{0.219365} = 5.369489207$    |


The speedup gained by this second implementation seems to really be worth it, both in terms of clock cycles and time gained.

### 3.2.3

#### a)

Each element occupies $2$ Bytes, since they are of type `uint16_t`. Therefore, since each line takes $64$ bytes (considering the value gathered in 3.1), the number of elements per line is $\frac{64}{2} = 32$.

#### b)

Program output:

```
After resetting counter 'PAPI_L1_DCM' [x10^6]: 0.000000
After resetting counter 'PAPI_LD_INS' [x10^6]: 0.000000
After resetting counter 'PAPI_SR_INS' [x10^6]: 0.000000
After stopping counter 'PAPI_L1_DCM'  [x10^6]: 5.810141
After stopping counter 'PAPI_LD_INS'  [x10^6]: 402.802696
After stopping counter 'PAPI_SR_INS'  [x10^6]: 134.222203
Wall clock cycles [x10^6]: 397.765652
Wall clock time [seconds]: 0.117256
Matrix checksum: 2717908992
```

| **Total number of L1 data cache misses**                |  5.810141   | **⨯ 10^6**  |
| ------------------------------------------------------- | --- | ----------- |
| **Total number of load / store instructions completed** |  402.802696 + 134.217780   | **⨯ 10^6**  |
| **Total number of clock cycles**                        | 397.765652   | **⨯ 10^6**  |
| **Elapsed time**                                        |  0.117256   | **seconds** |


#### c)

$$
\operatorname{HitRate} = 1 - \operatorname{MissRate} = 1 - \frac{\operatorname{Misses}}{\operatorname{Accesses}} = 1 - \frac{5.810141}{402.802696 + 134.217780} = 0.9891807831
$$

#### d)

| **$\Delta\text{HitRate} = \text{HitRate}_{\text{mm3}} - \text{HitRate}_{\text{mm1}}$**         |  0.9891807831 - 0.9677 = 0.0214807831  |
| ---------------------------------------------------------------------------------------------- | --- |
| **$\text{Speedup(\#Clocks)} = \frac{\text{Clocks}_{\text{mm1}}}{\text{Clocks}_{\text{mm3}}}$** |  $\frac{3995.673182}{397.765652} = 10.04529467$   |


This new implementation, exploring the spacial locality of the matrix, is way more efficient than the original, "naive" one, leading to a speedup of around $10$ times

#### e)

| **$\Delta\text{HitRate} = \text{HitRate}_{\text{mm3}} - \text{HitRate}_{\text{mm2}}$**         |  0.9891807831 - 0.99215 = −0.0029692169  |
| ---------------------------------------------------------------------------------------------- | --- |
| **$\text{Speedup(\#Clocks)} = \frac{\text{Clocks}_{\text{mm2}}}{\text{Clocks}_{\text{mm3}}}$** | $\frac{744.145336}{397.765652} = 1.870813461$    |


This new implementation ends up being a level above the transposition one, by a factor of about 1.87 times - it shows that, here, exploring the cache's spacial locality is more efficient than relying on a supposed algorithmic advantage, which might not even work well for very large matrixes.

We also, just as the question's statement suggested, went and tested the L2 miss events for both programs:

- for the transposition one, we found 4.473961 L2 misses;
- for the last one, we 0.471593 L2 misses.

This is crucial for the program's efficiency, since going up a level in the cache, be it L3 or memory, is very costly (in comparison with L1 and L2).

### 3.2.3

texto aqui