# Querying the public gReLU model zoo on Weights and Biases (wandb)

This tutorial shows how to programmatically query our public model zoo and download models and datasets. You can also visit the model zoo in your browser at https://wandb.ai/grelu/. 

## Rules

- wandb projects are the main storage units for datasets and the models trained on them. The main idea is to always keep the links between the raw dataset, the preprocessed dataset and the models trained on them for reproducibility, documentation and sanity reasons.
  
- The ideal wandb lineage is shown below. This lineage allows us to query project-model-dataset links via the API.

- Each project contains a notebook describing the details of data preprocessing, model training and model testing (e.g. performance metrics on holdout data). For models trained by us, the training logs are also available and can be seen by visiting the model zoo website. 

![image.png](lineage.png)

In [2]:
import os
import anndata
import grelu.resources

  from .autonotebook import tqdm as notebook_tqdm
  TF_GAMMAS = torch.load(str(DIR / "precomputed"/ "tf_gammas.pt"))


## List all available projects in the zoo

The `grelu.resources` module contains functions for interacting with the model zoo. First, we can list all available projects in the zoo:

In [3]:
grelu.resources.projects()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


[34m[1mwandb[0m: Currently logged in as: [33msarosavo[0m. Use [1m`wandb login --relogin`[0m to force relogin


['GM12878_dnase',
 'demo',
 'human-mpra-agrawal-2023',
 'binary_atac_cell_lines',
 'model-zoo-test',
 'alzheimers-variant-tutorial',
 'microglia-scatac-tutorial',
 'human-chromhmm-fullstack',
 'human-atac-catlas',
 'borzoi',
 'corces-microglia-scatac',
 'yeast-gpra',
 'enformer']

We choose the 'human-atac-catlas' project to interact with.

## List all datasets and models in a project

In [4]:
project_name = 'human-atac-catlas'

Individual objects such as datasets and models are stored as 'artifacts' under each project. Artifacts can be of different types, but the ones that we are generally interested in are "dataset" (the preprocessed dataset) and "model" (the trained model). We can search for these under the project of interest:

In [5]:
grelu.resources.artifacts(project_name, type_is="dataset")

['dataset']

This tells us that there is an artifact called "dataset" which is of the "dataset" type.

In [6]:
grelu.resources.artifacts(project_name, type_is="model")

['model']

This tells us that there is an artifact called "model" which is of the "model" type.

## Download a dataset

Let us now select the "dataset" artifact.

In [7]:
artifact = grelu.resources.get_artifact(
    name="dataset",
    project = project_name,
)
artifact

<Artifact QXJ0aWZhY3Q6ODUwODcxODM0>

We can download this artifact into a local directory.

In [8]:
artifact_dir = artifact.download()
artifact_dir

[34m[1mwandb[0m: Downloading large artifact dataset:latest, 202.72MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.5


'/data/yulai/projects/RLfinetuning_Diffusion_Bioseq/artifacts/dataset:v1'

We can list the iles in this directory:

In [9]:
os.listdir(artifact_dir)

['preprocessed.h5ad']

In [10]:
ad = anndata.read_h5ad(os.path.join(artifact_dir, 'preprocessed.h5ad'))
ad

# 导出观察（obs）元数据
ad.obs.to_csv('obs_data.csv')

# 导出变量（var）元数据
ad.var.to_csv('var_data.csv')

# 假设 ad.X 是稀疏矩阵
import pandas as pd

# 将 ad.X 转换为 DataFrame
data_matrix_df = pd.DataFrame(ad.X.toarray(), index=ad.obs.index, columns=ad.var.index)

# 将数据矩阵导出为 CSV 文件
data_matrix_df.to_csv('data_matrix.csv')

# with pd.ExcelWriter('anndata_output.xlsx') as writer:
#     ad.obs.to_excel(writer, sheet_name='Observations')
#     ad.var.to_excel(writer, sheet_name='Variables')
#     data_matrix_df.to_excel(writer, sheet_name='Data Matrix')


In [11]:
df = ad.to_df()  # Converts ad.X to a DataFrame
print(df.head())  # View the first few rows


                         0    1    2    3    4    5    6    7    8    9  ...  \
cell type                                                                ...   
Follicular             1.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  ...   
Fibro General          1.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  ...   
Acinar                 1.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  ...   
T Lymphocyte 1 (CD8+)  1.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  ...   
T lymphocyte 2 (CD4+)  1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  ...   

                       1121451  1121452  1121453  1121454  1121455  1121456  \
cell type                                                                     
Follicular                 1.0      0.0      0.0      0.0      0.0      0.0   
Fibro General              0.0      0.0      0.0      0.0      0.0      0.0   
Acinar                     1.0      0.0      0.0      0.0      0.0      0.0   
T Lymphocyte 1 (CD8+)      0.0      0.0     

We could download the trained model from the zoo in a similar way. However, we have an additional function to download a model from the zoo and directly load it into memory in one step.

## One-step downloading and loading a model

In [12]:
model = grelu.resources.load_model(
    project=project_name,
    model_name='model'
) # that's it!

[34m[1mwandb[0m: Downloading large artifact model:latest, 825.03MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.7
[34m[1mwandb[0m: Downloading large artifact human_state_dict:latest, 939.29MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.8
  state_dict = torch.load(Path(d) / "human.h5")


In [13]:
model

LightningModel(
  (model): EnformerPretrainedModel(
    (embedding): EnformerTrunk(
      (conv_tower): EnformerConvTower(
        (blocks): ModuleList(
          (0): Sequential(
            (0): Conv1d(4, 768, kernel_size=(15,), stride=(1,), padding=same)
            (1): ConvBlock(
              (norm): Norm(
                (layer): BatchNorm1d(768, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,), padding=same)
              (act): Activation(
                (layer): GELU()
              )
              (pool): Pool(
                (layer): AttentionPool(
                  (pool_fn): Rearrange('b d (n p) -> b d n p', p=2)
                  (to_attn_logits): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1), bias=False)
                )
              )
              (dropout): Dropout(
                (layer): Identity()
              )
              (channel_transform): ChannelT