## Step-by-step instructions to download HEST-1k 

This tutorial will guide you to:

- Download HEST-1k in its entirety (scanpy, whole-slide images, patches, nuclear segmentation, alignment preview)
- Download some samples of HEST-1k 
- Download samples with some attributes (e.g., all breast cancer cases) 
- Inspect freshly downloaded samples


## Instructions for Setting Up HuggingFace Account and Token

### 1. Create an Account on HuggingFace
Follow the instructions provided on the [HuggingFace sign-up page](https://huggingface.co/join).

### 2. Accept terms of use of HEST

1. Go to [HEST HuggingFace page](https://huggingface.co/datasets/MahmoodLab/hest)
2. Request access (access will be automatically granted)
3. At this stage, you can already manually inspect the data by navigating in the `Files and version`

### 3. Create a Hugging Face Token

1. **Go to Settings:** Navigate to your profile settings by clicking on your profile picture in the top right corner and selecting `Settings` from the dropdown menu.

2. **Access Tokens:** In the settings menu, find and click on `Access tokens`.

3. **Create New Token:**
   - Click on `New token`.
   - Set the token name (e.g., `hest`).
   - Set the access level to `Write`.
   - Click on `Create`.

4. **Copy Token:** After the token is created, copy it to your clipboard. You will need this token for authentication.

### 4. Logging

Install the python library `datasets` and run cell below. If successful, you should see:

```
Your token has been saved to /home/usr/.cache/huggingface/token
Login successful
```

In [None]:
%%bash
pip install datasets

In [None]:
from huggingface_hub import login

login(token="YOUR HUGGING FACE TOKEN")

### Download HEST-1k

In [None]:
import datasets

local_dir='../hest_data' # hest will be dowloaded to this folder

# Note that the full dataset is around 1TB of data
dataset = datasets.load_dataset(
    'MahmoodLab/hest', 
    cache_dir=local_dir,
    patterns='*'
)

### Download HEST-1k based on sample IDs

In [None]:
import datasets

local_dir='../hest_data' # hest will be dowloaded to this folder

ids_to_query = ['TENX95', 'TENX99'] # list of ids to query

list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
dataset = datasets.load_dataset(
    'MahmoodLab/hest', 
    cache_dir=local_dir,
    patterns=list_patterns
)

### Download HEST-1k based on metadata keys (e.g., organ, technology, oncotree code)

In [None]:
import datasets
import pandas as pd

local_dir='../hest_data' # hest will be dowloaded to this folder

meta_df = pd.read_csv("hf://datasets/MahmoodLab/hest/HEST_v1_0_2.csv")

# Filter the dataframe by organ, oncotree code...
meta_df = meta_df[meta_df['oncotree_code'] == 'IDC']
meta_df = meta_df[meta_df['organ'] == 'Breast']

ids_to_query = meta_df['id'].values

list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
dataset = datasets.load_dataset(
    'MahmoodLab/hest', 
    cache_dir=local_dir,
    patterns=list_patterns
)

### Inspect freshly downloaded samples

For each sample, we provide:

- **wsis/**: H&E-stained whole slide images in pyramidal Generic TIFF (or pyramidal Generic BigTIFF if >4.1GB)
- **st/**: Spatial transcriptomics expressions in a scanpy .h5ad object
- **metadata/**: Metadata
- **spatial_plots/**: Overlay of the WSI with the st spots
- **thumbnails/**: Downscaled version of the WSI
- **tissue_seg/**: Tissue segmentation masks:
    - `{id}_mask.jpg`: Downscaled or full resolution greyscale tissue mask
    - `{id}_mask.pkl`: Tissue/holes contours in a pickle file
    - `{id}_vis.jpg`: Visualization of the tissue mask on the downscaled WSI
- **pixel_size_vis/**: Visualization of the pixel size
- **patches/**: 256x256 H&E patches (0.5µm/px) extracted around ST spots in a .h5 object optimized for deep-learning. Each patch is matched to the corresponding ST profile (see **st/**) with a barcode.
- **patches_vis/**: Visualization of the mask and patches on a downscaled WSI.
- **transcripts/**: individual transcripts aligned to H&E for xenium samples; read with pandas.read_parquet; aligned coordinates in pixel are in columns `['he_x', 'he_y']`
- **cellvit_seg/**: Cellvit nuclei segmentation
- **xenium_seg**: xenium segmentation on DAPI and aligned to H&E


In [None]:
from hest import iter_hest
import pandas as pd

# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)
meta_df = pd.read_csv('../assets/HEST_v1_1_0.csv')

id_list = meta_df[meta_df['oncotree_code'] == 'ILC']['id'].values

print('load hest...')
# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=id_list):
    print(st)