# Snapshot Dataset Generation

## Common Premable

In [17]:
import os, sys
sys.path.append(os.path.join(os.path.abspath(''), '../'))

import peewee as pw
from toyDb.databases import ExperimentDb, ShaderDb
from toyDb.utils.Directory import getToyDbRootDir

import numpy as np
import json
import hashlib

ExperimentDb.init_from_default_db()

### Snapshot format

Snapshot format is as follows:

~~FragPerfSnapshotDataset~~
  - This is deprecated

FragTokenizedDataset (Consumed by `FragmentMaskedLMDataset`):
- input_ids: tokenizer output of SPIR-V shader representation
- NOTE: This are not trunctuated, since DataCollator will handle trunctuation

FragPerfSnapshotTracedDataset:
- "environmentId": self.environmentId
- "shaderId": expr.shader.shader_id
- "fragSpv": expr.shader.fragment_spv
  - `SPIR-V bytes`
- "traceFragSpv": expr.trace.traced_fragment_spv
  - `SPIR-V bytes`
- "timeMean": result_mean
  - `float`
- "bbIdxMap": {int(k): v for k, v in json.loads(expr.trace.bb_idx_map).items()}
  - `dict[int, int]`
- "bbTraceCounters": json.loads(expr.trace.bb_trace_counters)
  - `List[int]`


### Multi-environment considerations

Current flow is to add the environment suffix onto the json / dat filename.

### Persistence considerations
```python
with open("your_filename.png", "rb") as f:
    file_hash = hashlib.md5()
    chunk = f.read(8192)
    while chunk:
        file_hash.update(chunk)
        chunk = f.read(8192)

print(file_hash.hexdigest())
```

### Shader blacklist

This is stored in `shaderBlacklist.json`.

Reason is given below
- `fl2yzG`: Compiled SPIR-V have too many IDs

## 3060 Dataset Generation

Environment details:
- libreliu-GCL-Arch
- Linux 6.1.58-1-lts
- Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
- NVIDIA GeForce RTX 3060
- NVIDIA 535.113.01
- gfxclk1605-memclk7500

The results are saved to
- FragPerfSnapshotTracedDataset4096-3060.dat

> NOTE: `FragPerfSnapshotDataset4096-3060.json` is deprecated.
> Non-trace version should also consume the traced dataset.

In [18]:
from dataset.FragmentPerformanceWithTraceDataset import FragmentPerformanceWithTraceDataset
from dataset.FragmentPerformanceTracedSnapshotDataset import (
    FragmentPerformanceTracedSnapshotDataset
)

In [19]:
selectedEnv = ExperimentDb.Environment.select()[0]
print(f"Environment selected: \n"
      f"- {selectedEnv.node}\n"
      f"- {selectedEnv.cpu}\n"
      f"- {selectedEnv.gpu}\n"
      f"- {selectedEnv.gpu_driver}\n"
      f"- {selectedEnv.comment}"
)

Environment selected: 
- libreliu-GCL-Arch
-  Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
- NVIDIA GeForce RTX 3060
- NVIDIA 535.113.01
- gfxclk1605-memclk7500


In [20]:
traceDataset = FragmentPerformanceWithTraceDataset(
  environmentId=selectedEnv.id,
  filteredNumCycles=ExperimentDb.CANONICAL_NUM_CYCLES,
  filteredNumTrials=ExperimentDb.CANONICAL_NUM_TRIALS,
  useBlackList=True
)

print(f"Length of the traceDataset: {len(traceDataset)}")

Length of the traceDataset: 13867


Generation of FragPerfSnapshotTracedDataset4096-3060.dat

In [21]:
import misc.snapshotFragPerfTraceDataset
from misc.Directory import (
  getIntermediateDir,
  getVkPredictRootDir
)
import misc.TokenizerBuilder
import logging

logging.basicConfig(
  level=logging.INFO,
  format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

destFile = os.path.join(
  getIntermediateDir(),
  "./FragPerfSnapshotTracedDataset4096-3060.dat"
)

OVERRIDE_DATASET_SNAPSHOT_FILE = False
DO_SNAPSHOT = True
if os.path.isfile(destFile):
  print(f"You're overriding {destFile} if you run this cell.")
  print(f"Toggle the above to True if needed")

  if OVERRIDE_DATASET_SNAPSHOT_FILE:
    DO_SNAPSHOT = True
  else:
    DO_SNAPSHOT = False

if DO_SNAPSHOT:
  misc.snapshotFragPerfTraceDataset.snapshot(
    # train ratio
    0.8,
    # output dir
    destFile,
    # max tokenized length
    4096,
    # tokenizer
    misc.TokenizerBuilder.build_tokenizer("HfTracedSpvTokenizer-multiple-entrypoint"),
    # dset args
    {
      "environmentId": selectedEnv.id,
      "filteredNumCycles": ExperimentDb.CANONICAL_NUM_CYCLES,
      "filteredNumTrials": ExperimentDb.CANONICAL_NUM_TRIALS
    }
  )



2023-11-01 23:22:35,040 - misc.snapshotFragPerfTraceDataset - INFO - maxTokenizedLength == 4096
100%|██████████| 13867/13867 [01:36<00:00, 144.00it/s]
2023-11-01 23:24:11,354 - misc.snapshotFragPerfTraceDataset - INFO - Total: 13867, filtered: 11274, (81.301% of the original) train: 9019, test: 2255


Failed samples (total 0): []


Training sample serialization: 100%|██████████| 9019/9019 [00:00<00:00, 48821.96it/s]
Testing sample serialization: 100%|██████████| 2255/2255 [00:00<00:00, 47581.50it/s]
2023-11-01 23:24:11,700 - misc.snapshotFragPerfTraceDataset - INFO - Snapshot written to /home/libreliu/Projects/NGPP/vkPredict/misc/.././intermediates/./FragPerfSnapshotTracedDataset4096-3060.dat


Then we record a md5sum associated with the dataset

In [22]:
with open(destFile, "rb") as f:
    file_hash = hashlib.md5()
    chunk = f.read(8192)
    while chunk:
        file_hash.update(chunk)
        chunk = f.read(8192)

print(f"Hash for {destFile}:\n- md5sum: {file_hash.hexdigest()}")

Hash for /home/libreliu/Projects/NGPP/vkPredict/misc/.././intermediates/./FragPerfSnapshotTracedDataset4096-3060.dat:
- md5sum: 4c48549fb2b0a11237fde4592e3b5335


Then we read it back to check quantity

In [24]:
from dataset.FragmentPerformanceTracedSnapshotDataset import FragmentPerformanceTracedSnapshotDataset

trainDataset = FragmentPerformanceTracedSnapshotDataset(destFile, 'train')
testDataset = FragmentPerformanceTracedSnapshotDataset(destFile, 'test')
totalLen = len(trainDataset) + len(testDataset)

print(f"Train dataset: {len(trainDataset)}")
print(f"Test dataset: {len(testDataset)}")
print(f"Total: {totalLen} - {totalLen / len(traceDataset) * 100}% of the original")

Train dataset: 9019
Test dataset: 2255
Total: 11274 - 81.30093026609937%% of the original
