# Notes

## Dataloader (pretraining)
* [HyenaDNA HG38 dataloader](https://github.com/HazyResearch/hyena-dna/blob/main/src/dataloaders/datasets/hg38_dataset.py)

## Tokenizer?
* Need to check HyenaDNA; I think their Jupyter contained some code of their tokenizer

## Model
* [Original MAMBA repo](https://github.com/state-spaces/mamba)
    * [benchmark_generation_mamba_simple.py](https://github.com/state-spaces/mamba/blob/main/benchmarks/benchmark_generation_mamba_simple.py)
    * Uses [mambaLMHeadModel](https://github.com/state-spaces/mamba/blob/bae8d1a42fec58f4cdd300bf3b987d05eab22ed0/mamba_ssm/models/mixer_seq_simple.py#L173) form `mixer_seq_simple.py`
    * Uses [MixerModel](https://github.com/state-spaces/mamba/blob/bae8d1a42fec58f4cdd300bf3b987d05eab22ed0/mamba_ssm/models/mixer_seq_simple.py#L83)
    * Uses [create_block](https://github.com/state-spaces/mamba/blob/bae8d1a42fec58f4cdd300bf3b987d05eab22ed0/mamba_ssm/models/mixer_seq_simple.py#L21)
    * Uses [Block](https://github.com/state-spaces/mamba/blob/bae8d1a42fec58f4cdd300bf3b987d05eab22ed0/mamba_ssm/modules/mamba_simple.py#L298) and [MAMBA](https://github.com/state-spaces/mamba/blob/bae8d1a42fec58f4cdd300bf3b987d05eab22ed0/mamba_ssm/modules/mamba_simple.py#L34) classes (from `mamba_simple.py`)
        * Actual MAMBA operation: [mamba_inner_fn](https://github.com/state-spaces/mamba/blob/bae8d1a42fec58f4cdd300bf3b987d05eab22ed0/mamba_ssm/ops/selective_scan_interface.py#L155)
* [SimplerMambaSSM Jupyter Notebook](./SimplerMambaSSM.ipynb)
    * Use mamba-ssm library
    * See class BigNeuralNetwork
* [MAMBA chat](https://github.com/havenhq/mamba-chat/blob/main/train_mamba.py)

In [14]:
from zipfile import ZipFile
from io import BytesIO
import requests
import os
from pathlib import Path

import torch

# Download genetic data

In [20]:
# datasets
hg38_url = 'https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000001405.40/download'
t2t_url = 'https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_009914755.1/download'
dataset_url = hg38

print("download started...")
response = requests.get(dataset_url, params={'include_annotation_type': 'GENOME_FASTA'})
if response.status_code == 200:
    data_dir_path = 'dataset'
    os.makedirs(data_dir_path, exist_ok=True)
    with BytesIO(response.content) as zip_buffer:
        ZipFile(zip_buffer, 'r').extractall(path=data_dir_path)
    print("dataset ready")

gh38_fasta = 'dataset/ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna'

print("FASTA files:")
fpaths = list(Path('dataset').rglob('*.fna'))
for fpath in fpaths:
    print(fpath)

data_path = fpaths[0]

download started...
dataset ready
FASTA files:
dataset/ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna


# Model

In [21]:
print("CUDA is available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
gpu = 4
torch.cuda.set_device(gpu)
print("Current GPU:", torch.cuda.get_device_name())

CUDA is available: True
GPU count: 8
Current GPU: NVIDIA RTX 6000 Ada Generation
