# S02: Reorganizing Slides at Patient-level

After the step S01, the downloaded files are stored in the directories that look like meaningless. 

Here you reorganize the slide files at patient-level, i.e., the slide files from the same patient will be stored in the same directory, and this directory is named using patient ID.

## 1. Loading gdc_sample_sheet.tsv

You should, at first, load the file `gdc_sample_sheet.XXXXXXXXX.tsv` downloaded from TCGA website, as described in the step S01.

In [1]:
import pandas as pd

# please use your own gdc samples sheet. Here is an example for illustrating
FILEPATH_TO_GDC_SAMPLE_SHEET = "./docs/gdc_sample_sheet.tsv"

df = pd.read_csv(FILEPATH_TO_GDC_SAMPLE_SHEET, sep='\t')
df.head()

Unnamed: 0,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type
0,0596623c-c2c5-4de5-b358-d5393e79120e,TCGA-B3-4103-01Z-00-DX1.76bba2e9-0a6d-460b-8ae...,Biospecimen,Slide Image,TCGA-KIRP,TCGA-B3-4103,TCGA-B3-4103-01Z,Primary Tumor
1,b1b3df18-1fcc-40a1-8610-143f06c9748b,TCGA-AL-3468-01Z-00-DX1.F86A4811-D60C-4845-A7A...,Biospecimen,Slide Image,TCGA-KIRP,TCGA-AL-3468,TCGA-AL-3468-01Z,Primary Tumor
2,e55f0570-5c9e-4676-8b65-380ae02a8d63,TCGA-A4-7997-01Z-00-DX1.aa4e2dd8-fac9-43ae-963...,Biospecimen,Slide Image,TCGA-KIRP,TCGA-A4-7997,TCGA-A4-7997-01Z,Primary Tumor
3,04ea6765-f97b-45a3-9c50-7882b2edf61a,TCGA-HE-A5NF-01Z-00-DX1.74ABE42F-E64E-4550-AD8...,Biospecimen,Slide Image,TCGA-KIRP,TCGA-HE-A5NF,TCGA-HE-A5NF-01Z,Primary Tumor
4,212eed8a-ee10-4149-a5c8-7effb1d4747e,TCGA-EV-5903-01Z-00-DX1.04ef7cdf-b282-4ad3-917...,Biospecimen,Slide Image,TCGA-KIRP,TCGA-EV-5903,TCGA-EV-5903-01Z,Primary Tumor


## 2. Extracting data for possible classification tasks 

A possible task for the project `TCGA-RCC` is to identify the RCC subtype of tissue slides. There are three RCC subtypes that are derived from the three sub-projects you have downloaded, which are as follows:
- `Clear Cell (CCRCC)` from `TCGA-KIRC`
- `Papillary (PRCC)` from `TCGA-KIRP`
- `Chromophobe (CHRCC)` from `TCGA-KICH`

Now, you could extract useful data from GDC sample sheet for RCC subtyping task, and save it in your server (below it is saved as a file named `TCGA_RCC_path_subtype.csv` in `/NAS02/ExpData/tcga_rcc/table`).

In [2]:
import os
import os.path as osp

SUBTYPE_TO_CODE = {"CCRCC": 0, "PRCC": 1, "CHRCC": 2}
def derive_subtype(prj, ret_format='type'):
    subtype = None
    if prj == 'TCGA-KIRC':
        subtype = 'CCRCC'
    elif prj == 'TCGA-KIRP':
        subtype = 'PRCC'
    elif prj == 'TCGA-KICH':
        subtype = 'CHRCC'
    else:
        pass
    if ret_format == 'type':
        return subtype
    elif ret_format == 'code':
        return SUBTYPE_TO_CODE[subtype]

df['patient_id'] = df['Case ID'].apply(lambda x: x.strip()[:12])
df['pathology_id'] = df['File Name'].apply(lambda x: os.path.splitext(x.strip())[0])
df['subtype'] = df['Project ID'].apply(lambda x: derive_subtype(x.strip(), ret_format='type'))
df['label'] = df['Project ID'].apply(lambda x: derive_subtype(x.strip(), ret_format='code'))
df_subtype = df.iloc[:, -4:]
df_subtype.to_csv("/NAS02/ExpData/tcga_rcc/table/TCGA_RCC_path_subtype.csv", index=False)

In [3]:
# check if slide IDs are consistent with its patient ID.
for i in df_subtype.index:
    if df_subtype.loc[i, "patient_id"] != df_subtype.loc[i, "pathology_id"][:12]:
        print(i, df_subtype.loc[i, "patient_id"])
df_subtype.head()

Unnamed: 0,patient_id,pathology_id,subtype,label
0,TCGA-B3-4103,TCGA-B3-4103-01Z-00-DX1.76bba2e9-0a6d-460b-8ae...,PRCC,1
1,TCGA-AL-3468,TCGA-AL-3468-01Z-00-DX1.F86A4811-D60C-4845-A7A...,PRCC,1
2,TCGA-A4-7997,TCGA-A4-7997-01Z-00-DX1.aa4e2dd8-fac9-43ae-963...,PRCC,1
3,TCGA-HE-A5NF,TCGA-HE-A5NF-01Z-00-DX1.74ABE42F-E64E-4550-AD8...,PRCC,1
4,TCGA-EV-5903,TCGA-EV-5903-01Z-00-DX1.04ef7cdf-b282-4ad3-917...,PRCC,1


## 3. Reorganizing Slides
Next, we start to reorganize all slides at patient-level according to the GDC sample sheet.

But at first, we should check if the downloaded files are consistent with the GDC samples sheet.

In [4]:
DIR_DOWNLOAD = '/NAS02/RawData/download_rcc'
for i in df.index:
    filename = df.loc[i, "File Name"].strip()
    file_read_path = osp.join(DIR_DOWNLOAD, df.loc[i, "File ID"].strip(), filename)
    if not osp.exists(file_read_path):
        print("Not found {} in your downloaded files".format(filename))

Let us start moving files and put them in their respective patient ID directories.

In [6]:
import shutil
from tqdm import tqdm

DIR_SAVE = '/NAS02/RawData/tcga_rcc'
if not osp.exists(DIR_SAVE):
    os.makedirs(DIR_SAVE)

for _, r in tqdm(df.iterrows()):
    filename = r["File Name"].strip()
    file_read_path = osp.join(DIR_DOWNLOAD, r["File ID"].strip(), filename)
    
    file_save_path = osp.join(DIR_SAVE, r["Case ID"].strip()[:12])
    if not osp.exists(file_save_path):
        os.makedirs(file_save_path)
    
    shutil.move(file_read_path, file_save_path)
    print("Finished moving {} to {}.".format(filename, file_save_path))

155it [00:00, 773.90it/s]

Finished moving TCGA-B3-4103-01Z-00-DX1.76bba2e9-0a6d-460b-8ae8-c38a1109456e.svs to /NAS02/RawData/tcga_rcc/TCGA-B3-4103.
Finished moving TCGA-AL-3468-01Z-00-DX1.F86A4811-D60C-4845-A7A5-B7C7BE202202.svs to /NAS02/RawData/tcga_rcc/TCGA-AL-3468.
Finished moving TCGA-A4-7997-01Z-00-DX1.aa4e2dd8-fac9-43ae-9634-4a3612d8c154.svs to /NAS02/RawData/tcga_rcc/TCGA-A4-7997.
Finished moving TCGA-HE-A5NF-01Z-00-DX1.74ABE42F-E64E-4550-AD8B-8B6DCEA84FE7.svs to /NAS02/RawData/tcga_rcc/TCGA-HE-A5NF.
Finished moving TCGA-EV-5903-01Z-00-DX1.04ef7cdf-b282-4ad3-917d-31bb7b379559.svs to /NAS02/RawData/tcga_rcc/TCGA-EV-5903.
Finished moving TCGA-BQ-5881-01Z-00-DX1.5eff6d78-773a-49d6-be59-de63172d543c.svs to /NAS02/RawData/tcga_rcc/TCGA-BQ-5881.
Finished moving TCGA-5P-A9KE-01Z-00-DX1.7295A9F6-0C5A-4DBC-9772-94CBB47D28A3.svs to /NAS02/RawData/tcga_rcc/TCGA-5P-A9KE.
Finished moving TCGA-BQ-5887-01Z-00-DX1.53c6100e-4ce4-4863-a987-ee0fe30f2e01.svs to /NAS02/RawData/tcga_rcc/TCGA-BQ-5887.
Finished moving TCGA-G7-

247it [00:00, 808.61it/s]

Finished moving TCGA-BP-5192-01Z-00-DX1.a907179e-7b54-4315-bed1-6f1f12609cc9.svs to /NAS02/RawData/tcga_rcc/TCGA-BP-5192.
Finished moving TCGA-B8-5164-01Z-00-DX1.6d846ab5-2fcf-4a99-8b82-3b77146fccf3.svs to /NAS02/RawData/tcga_rcc/TCGA-B8-5164.
Finished moving TCGA-CJ-4637-01Z-00-DX1.5632396E-A880-4686-9E94-89E3F01AFBFB.svs to /NAS02/RawData/tcga_rcc/TCGA-CJ-4637.
Finished moving TCGA-BP-5183-01Z-00-DX1.f84b4982-b51e-4220-a80b-767ecbb3e20f.svs to /NAS02/RawData/tcga_rcc/TCGA-BP-5183.
Finished moving TCGA-B0-4841-01Z-00-DX1.a33f52a7-d045-4aaa-acbe-e3122f33147c.svs to /NAS02/RawData/tcga_rcc/TCGA-B0-4841.
Finished moving TCGA-A3-3307-01Z-00-DX1.E9BB41BE-9F96-49A7-8164-A7E715A00EB9.svs to /NAS02/RawData/tcga_rcc/TCGA-A3-3307.
Finished moving TCGA-CJ-4870-01Z-00-DX1.38991588-D25A-48F6-80B5-51ED34AA8D35.svs to /NAS02/RawData/tcga_rcc/TCGA-CJ-4870.
Finished moving TCGA-A3-3363-01Z-00-DX1.6d933fa1-b68c-483f-819c-5a85f0389e6c.svs to /NAS02/RawData/tcga_rcc/TCGA-A3-3363.
Finished moving TCGA-B8-

404it [00:00, 662.01it/s]

Finished moving TCGA-SX-A7SL-01Z-00-DX1.A3C414EF-9AB0-43E0-A833-D1782B633F07.svs to /NAS02/RawData/tcga_rcc/TCGA-SX-A7SL.
Finished moving TCGA-P4-AAVL-01Z-00-DX1.6A86A318-057A-468D-826E-3B5C8B224266.svs to /NAS02/RawData/tcga_rcc/TCGA-P4-AAVL.
Finished moving TCGA-2Z-A9JI-01Z-00-DX1.41A4F992-87D9-43A4-8776-D11BF5EB1123.svs to /NAS02/RawData/tcga_rcc/TCGA-2Z-A9JI.
Finished moving TCGA-2Z-A9JN-01Z-00-DX1.714E8577-5BAE-4D99-92DF-1B75500D7AED.svs to /NAS02/RawData/tcga_rcc/TCGA-2Z-A9JN.
Finished moving TCGA-SX-A7SU-01Z-00-DX1.3EB1265B-F8CF-4703-88DB-7DC4E1D7E9C8.svs to /NAS02/RawData/tcga_rcc/TCGA-SX-A7SU.
Finished moving TCGA-F9-A7Q0-01Z-00-DX1.D7C7E174-CBB3-4489-A327-3C4000CA53B1.svs to /NAS02/RawData/tcga_rcc/TCGA-F9-A7Q0.
Finished moving TCGA-B1-7332-01Z-00-DX1.a04eb9f5-6b35-4877-8b54-f675fab63925.svs to /NAS02/RawData/tcga_rcc/TCGA-B1-7332.
Finished moving TCGA-B1-A47M-01Z-00-DX1.B049A5A8-406C-4208-BB5E-2CA42D063D16.svs to /NAS02/RawData/tcga_rcc/TCGA-B1-A47M.
Finished moving TCGA-B9-

552it [00:00, 694.69it/s]

Finished moving TCGA-B0-4842-01Z-00-DX1.d780158b-81c2-4ac9-b7a0-9c9386c6414c.svs to /NAS02/RawData/tcga_rcc/TCGA-B0-4842.
Finished moving TCGA-AK-3425-01Z-00-DX1.39FD616C-3050-41E1-BF80-D2069E492FDE.svs to /NAS02/RawData/tcga_rcc/TCGA-AK-3425.
Finished moving TCGA-A3-3319-01Z-00-DX1.38d515b4-6488-421a-9d61-d1e92d98ddbe.svs to /NAS02/RawData/tcga_rcc/TCGA-A3-3319.
Finished moving TCGA-B4-5835-01Z-00-DX1.55c88092-9092-4881-a8a1-88dafc0a58be.svs to /NAS02/RawData/tcga_rcc/TCGA-B4-5835.
Finished moving TCGA-B0-4706-01Z-00-DX1.a29630d2-be74-4245-8d47-2e1694bea5e4.svs to /NAS02/RawData/tcga_rcc/TCGA-B0-4706.
Finished moving TCGA-B0-4818-01Z-00-DX1.b5b74b2d-0d2f-40b3-81c9-e2b570da5918.svs to /NAS02/RawData/tcga_rcc/TCGA-B0-4818.
Finished moving TCGA-B0-4690-01Z-00-DX1.efecabc0-bb34-41e4-b631-0b2043826124.svs to /NAS02/RawData/tcga_rcc/TCGA-B0-4690.
Finished moving TCGA-A3-3367-01Z-00-DX1.9fca6a91-bb71-4632-8b20-e9f4d7030699.svs to /NAS02/RawData/tcga_rcc/TCGA-A3-3367.
Finished moving TCGA-BP-

802it [00:01, 861.44it/s]

Finished moving TCGA-BP-4961-01Z-00-DX1.4971f9c1-32b8-4eb2-b7a5-0985aaf83422.svs to /NAS02/RawData/tcga_rcc/TCGA-BP-4961.
Finished moving TCGA-T7-A92I-01Z-00-DX2.87A6A24E-42E3-4107-AE91-369C0D168556.svs to /NAS02/RawData/tcga_rcc/TCGA-T7-A92I.
Finished moving TCGA-AK-3447-01Z-00-DX1.3ed3ae76-21b2-46d0-bcb8-2d9ff54908ce.svs to /NAS02/RawData/tcga_rcc/TCGA-AK-3447.
Finished moving TCGA-T7-A92I-01Z-00-DX1.3B036C1D-F8A7-475F-9830-C0972AD3889F.svs to /NAS02/RawData/tcga_rcc/TCGA-T7-A92I.
Finished moving TCGA-B8-5162-01Z-00-DX1.3e999548-f3b9-46ca-b51a-d20735e1249b.svs to /NAS02/RawData/tcga_rcc/TCGA-B8-5162.
Finished moving TCGA-CZ-4858-01Z-00-DX1.657d1abf-8811-4ea5-a02b-89a6eeaeafbb.svs to /NAS02/RawData/tcga_rcc/TCGA-CZ-4858.
Finished moving TCGA-CJ-4643-01Z-00-DX1.50F38125-1825-4B66-A171-4C92279E306D.svs to /NAS02/RawData/tcga_rcc/TCGA-CJ-4643.
Finished moving TCGA-A3-3382-01Z-00-DX1.38cdb806-248a-4030-8a47-b5ce266991de.svs to /NAS02/RawData/tcga_rcc/TCGA-A3-3382.
Finished moving TCGA-BP-

940it [00:01, 794.70it/s]

Finished moving TCGA-KO-8405-01Z-00-DX1.DB2F9FFE-B562-4D52-A4DF-E939DE99B5F7.svs to /NAS02/RawData/tcga_rcc/TCGA-KO-8405.
Finished moving TCGA-KL-8329-01Z-00-DX1.6c9000ef-34ff-4e44-84b3-755a868f6a4e.svs to /NAS02/RawData/tcga_rcc/TCGA-KL-8329.
Finished moving TCGA-NP-A5H5-01Z-00-DX1.E6434A29-9182-4476-9E89-62B9C8EABC5B.svs to /NAS02/RawData/tcga_rcc/TCGA-NP-A5H5.
Finished moving TCGA-KO-8408-01Z-00-DX1.4D45F809-E93A-4221-832A-97C3A4B4E30C.svs to /NAS02/RawData/tcga_rcc/TCGA-KO-8408.
Finished moving TCGA-NP-A5GZ-01Z-00-DX1.07B19B85-A1CE-4AF2-B3E2-C86D11FAC907.svs to /NAS02/RawData/tcga_rcc/TCGA-NP-A5GZ.
Finished moving TCGA-UW-A7GP-01Z-00-DX1.C997CBA5-AF20-4CB3-9CB3-82AA65146EF5.svs to /NAS02/RawData/tcga_rcc/TCGA-UW-A7GP.
Finished moving TCGA-UW-A7GY-01Z-00-DX1.CD2CCA5D-C92B-409C-B5D6-1EB7C8A0B4CD.svs to /NAS02/RawData/tcga_rcc/TCGA-UW-A7GY.
Finished moving TCGA-UW-A72J-01Z-00-DX1.B62772BA-DC29-40C9-B5C8-AC47AFA0437D.svs to /NAS02/RawData/tcga_rcc/TCGA-UW-A72J.
Finished moving TCGA-KN-




At the end, we print the number of slides and patients

In [8]:
cnt_pat, cnt_slide = 0, 0
for d in os.listdir(DIR_SAVE):
    if d.startswith('TCGA'):
        cnt_pat += 1
        cur_dir = osp.join(DIR_SAVE, d)
        cnt_slide += len([f for f in os.listdir(cur_dir) if f.endswith(".svs")])
print("# Patients:", cnt_pat)
print("# Slides:", cnt_slide)

# Patients: 898
# Slides: 940
