# S04: Extracting Patch Features

Here we utilize a image feature extractor to extract deep features from all the patches that we have obtained from the step `S03`. The image feature extractor used in this step is [`CONCH`](https://github.com/mahmoodlab/CONCH), a vision-language model pretrained on pathology images. 

We adopt [our enhanced CLAM](./tools/CLAM) to introduce how to extract patch features.

## 1. Some Notes

### 1.1 Overall Procedure

In this step, for each slide, its patch coordinates, which are stored in a `h5` file (see `ROOT_DIR_FOR_DATA_SAVING/tiles-20x-s256/patches/` in your server), would be loaded and then used to locate certain patch regions in this slide image (at the magnification you specified in the step `S03`). 

Meanwhile, the source file of the slide will also be loaded for reading patch regions. At the end, all patch features of the slide will be saved in a `pt` file. 

### 1.2 Extracting Patch Features at a Unified Magnification

Recalling the previous step (S03), we patch all WSIs at a unified magnification by specifying `--patch_magnification 20` and `--patch_size 256`. 

So, when the WSIs have **different highest magnifications** (`20x` or `40x`, often seen in `TCGA`), different patch sizes (e.g. `256 x 256` or `512 x 512`) could be produced. Therefore, we need to specifiy `target_patch_size` as `256` to **resize patches** to the same size for feature extraction.

By doing the above, we actually read `256 x 256` patches at `20x` and extract features from the patches with the same size. 

## 2. Running feature extraction

In this step, we have improved CLAM specifically in terms of 
- more alternative architectures (including 6+ pretrained SOTA model) for patch feature extracting, where 
  - `CONCH` / `UNI` is recommended; refer to their scripts in [tools/scripts](https://github.com/liupei101/Pipeline-Processing-TCGA-Slides-for-MIL/tree/main/tools/scripts)
  - `CTransPath` and `PLIP` are good alternatives (both are free for use), when you cannot use CONCH due to limited access rights
  - `CTransPath` and `PLIP` scripts are provided in [tools/scripts](https://github.com/liupei101/Pipeline-Processing-TCGA-Slides-for-MIL/tree/main/tools/scripts)
  - The `truncated ResNet50` and `ResNet18 w/ SimCL` are **NOT** recommended

### 2.1 Running Scripts

A detailed bash script (placed at `./tools/scripts/S04-Extracting-Feats.sh`), with `CONCH` as the patch feature extractor, is as follows:

```bash
#!/bin/bash
set -e

# Sample patches of SIZE x SIZE at MAG (as used in S03)
MAG=20
SIZE=256

# Path where CLAM is installed
DIR_REPO=../CLAM

# Root path to pathology images 
DIR_RAW_DATA=/NAS02/RawData/tcga_rcc
DIR_EXP_DATA=/NAS02/ExpData/tcga_rcc

# Sub-directory to the patch coordinates generated from S03
SUBDIR_READ=tiles-${MAG}x-s${SIZE}

# Arch to be used for patch feature extraction (CONCH is strongly recommended)
ARCH=CONCH

# Model path
# You need to first apply for its access rights via https://huggingface.co/MahmoodLab/CONCH
# and then download a model file named `pytorch_model.bin`.
MODEL_CKPT=/path/to/conch/pytorch_model.bin

# Sub-directory to the patch features 
SUBDIR_SAVE=${SUBDIR_READ}/feats-${ARCH}

cd ${DIR_REPO}

echo "running for extracting features from all tiles"
CUDA_VISIBLE_DEVICES=0 python3 extract_features_fp.py \
    --arch ${ARCH} \
    --ckpt_path ${MODEL_CKPT} \
    --data_h5_dir ${DIR_EXP_DATA}/${SUBDIR_READ} \
    --data_slide_dir ${DIR_RAW_DATA} \
    --csv_path ${DIR_EXP_DATA}/${SUBDIR_READ}/process_list_autogen.csv \
    --feat_dir ${DIR_EXP_DATA}/${SUBDIR_SAVE} \
    --target_patch_size ${SIZE} \
    --batch_size 128 \
    --slide_ext .svs \
    --slide_in_child_dir \
    --proj_to_contrast N

```

You could run this script using the following command:
```bash
nohup ./S04-Extracting-Feats.sh > S04-Extract-Feats.log 2>&1 &
```

Full running logs could be found in `./tools/scripts/S04-Extract-Feats.log`. 

Next, we check if the number of generated files is consistent with that of patch files from the step `S03`.

In [None]:
import os
import os.path as osp

DIR_FEAT = "/NAS02/ExpData/tcga_rcc/tiles-20x-s256/feats-CONCH/pt_files"
feat_files = [f for f in os.listdir(DIR_FEAT) if f.endswith(".pt")]
print("This step generated {} feature files in {}.".format(len(feat_files), DIR_FEAT))

In [None]:
DIR_PATCH = "/NAS02/ExpData/tcga_rcc/tiles-20x-s256/patches"
patch_files = [f for f in os.listdir(DIR_PATCH) if f.endswith(".h5")]
print("The step S03 generated {} patch files in {}.".format(len(patch_files), DIR_PATCH))

In [None]:
feat_filenames = [osp.splitext(f)[0] for f in feat_files]
patch_filenames = [osp.splitext(f)[0] for f in patch_files]
flag = False
for f in patch_filenames:
    if f not in feat_filenames:
        flag = True
        print("Expected {}, but it was not found in features files.".format(f))
if flag:
    print("Some slides were not processed.")
else:
    print("All slides in patch directory have been processed in this step.")


### Example of Running Logs

The running log of the first WSI is presented as follows:

```txt
progress: 0/940
TCGA-2K-A9WE-01Z-00-DX1.ED8ADE3B-D49B-403B-B4EB-BD11D91DD676
downsample [4.00005125 4.00008641]
downsampled_level_dim [39021 23146]
level_dim [39021 23146]
name TCGA-2K-A9WE-01Z-00-DX1.ED8ADE3B-D49B-403B-B4EB-BD11D91DD676
patch_level 0
patch_size 512
save_path /NAS02/ExpData/tcga_rcc/tiles-20x-s256/patches

feature extraction settings:
-- target patch size:  None
-- imagenet_pretrained:  False
-- patches sampler: None
-- color normalization: None
-- color argmentation: None
-- add_patch_noise: None
-- vertical_flip: False
-- transformations:  Compose(
    Resize(size=256, interpolation=bicubic, max_size=None, antialias=None)
    CenterCrop(size=(256, 256))
    <function _convert_to_rgb at 0x7f63c5177160>
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)
-- enable direct transform:  True
processing /NAS02/ExpData/tcga_rcc/tiles-20x-s256/patches/TCGA-2K-A9WE-01Z-00-DX1.ED8ADE3B-D49B-403B-B4EB-BD11D91DD676.h5: total of 57 batches
batch 0/57, 0 files processed
batch 20/57, 2560 files processed
batch 40/57, 5120 files processed
features size:  torch.Size([7274, 1024])
saved pt file:  /NAS02/ExpData/tcga_rcc/tiles-20x-s256/feats-CONCH/pt_files/TCGA-2K-A9WE-01Z-00-DX1.ED8ADE3B-D49B-403B-B4EB-BD11D91DD676.pt

computing features for /NAS02/ExpData/tcga_rcc/tiles-20x-s256/feats-CONCH/pt_files/TCGA-2K-A9WE-01Z-00-DX1.ED8ADE3B-D49B-403B-B4EB-BD11D91DD676.pt took 75.39572024345398 s
```