# Description

According to the settings specified below, this notebook:
 1. reads all the data from one source (GTEx, recount2, etc) according to the gene selection method (`GENE_SELECTION_STRATEGY`),
 2. runs a quick performance test using the correlation coefficient specified (`CORRELATION_METHOD`), and
 3. computes the correlation matrix across all the genes using the correlation coefficient specified.

# Modules

In [1]:
import pandas as pd
from time import time
from tqdm import tqdm
from pathlib import Path

from ccc.utils import simplify_string
from ccc.corr import ccc_gpu

# Settings

In [2]:
GENE_SELECTION_STRATEGY = "var_pc_log2"
TOP_N_GENES = "all"

In [3]:
# select the top 5 tissues (according to sample size, see nbs/05_preprocessing/00-gtex_v8-split_by_tissue.ipynb)
TISSUES = [
    # "Muscle - Skeletal",
    "Whole Blood",
    # "Skin - Sun Exposed (Lower leg)",
    # "Adipose - Subcutaneous",
    # "Artery - Tibial",
]

In [4]:
N_CPU_CORE = 24

In [5]:
CORRELATION_METHOD = lambda x: ccc_gpu(x, n_jobs=N_CPU_CORE)
CORRELATION_METHOD.__name__ = "ccc_gpu"

method_name = CORRELATION_METHOD.__name__
display(method_name)

'ccc_gpu'

In [6]:
BENCHMARK_N_TOP_GENE = 5000

# Paths

In [7]:
DATA_DIR = Path("/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8")
INPUT_DIR = DATA_DIR / "gene_selection" / "all"
display(INPUT_DIR)

assert INPUT_DIR.exists()

PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/gene_selection/all')

In [8]:
OUTPUT_DIR = DATA_DIR / "similarity_matrices" / TOP_N_GENES
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
display(OUTPUT_DIR)

PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/similarity_matrices/all')

# Data loading

In [9]:
tissue_in_file_names = [f"_data_{simplify_string(t.lower())}-" for t in TISSUES]

In [10]:
input_files = sorted(list(INPUT_DIR.glob(f"*-{GENE_SELECTION_STRATEGY}.pkl")))
input_files = [
    f for f in input_files if any(tn in f.name for tn in tissue_in_file_names)
]
display(len(input_files))

assert len(input_files) == len(TISSUES), len(TISSUES)
display(input_files)

1

[PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/gene_selection/all/gtex_v8_data_whole_blood-var_pc_log2.pkl')]

# Compute similarity

## Performance test

In [11]:
display(input_files[0])
test_data = pd.read_pickle(input_files[0])

PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/gene_selection/all/gtex_v8_data_whole_blood-var_pc_log2.pkl')

In [12]:
test_data.shape

(56200, 755)

In [13]:
test_data.head()

Unnamed: 0_level_0,GTEX-111YS-0006-SM-5NQBE,GTEX-1122O-0005-SM-5O99J,GTEX-1128S-0005-SM-5P9HI,GTEX-113IC-0006-SM-5NQ9C,GTEX-113JC-0006-SM-5O997,GTEX-117XS-0005-SM-5PNU6,GTEX-117YW-0005-SM-5NQ8Z,GTEX-1192W-0005-SM-5NQBQ,GTEX-1192X-0005-SM-5NQC3,GTEX-11DXW-0006-SM-5NQ7Y,...,GTEX-ZVE2-0006-SM-51MRW,GTEX-ZVP2-0005-SM-51MRK,GTEX-ZVT2-0005-SM-57WBW,GTEX-ZVT3-0006-SM-51MT9,GTEX-ZVT4-0006-SM-57WB8,GTEX-ZVTK-0006-SM-57WBK,GTEX-ZVZP-0006-SM-51MSW,GTEX-ZVZQ-0006-SM-51MR8,GTEX-ZXES-0005-SM-57WCB,GTEX-ZXG5-0005-SM-57WCN
gene_ens_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000169429.10,0.5623,0.8067,116.9,4.047,211.0,58.11,68.38,249.5,5.095,295.9,...,39.96,0.1393,0.2238,245.0,513.6,1626.0,0.5633,515.7,1.194,1163.0
ENSG00000135245.9,0.6529,1.385,199.2,2.266,116.7,192.3,161.5,263.5,23.54,251.9,...,114.3,1.833,0.4115,149.0,935.3,233.6,0.8882,134.0,1.12,295.7
ENSG00000163631.16,1.848,0.2503,0.08429,1.251,1348.0,9.971,101.3,95.09,1.264,119.3,...,2.092,2.11,0.03588,171.8,107.1,71.25,1.772,309.6,0.07361,17.75
ENSG00000277632.1,1.696,1.345,235.1,11.77,141.7,199.1,525.5,659.9,10.91,209.3,...,61.34,2.25,0.7231,261.2,400.0,288.5,2.696,287.5,3.323,618.9
ENSG00000239839.6,185.2,1.779,694.3,23.84,297.3,3122.0,2521.0,1504.0,80.06,652.0,...,1010.0,253.8,94.52,6083.0,2768.0,52.06,34.57,17.36,352.3,63.85


This is a quick performance test of the correlation measure. The following line (`_tmp = ...`) is the setup code, which is needed in case the correlation method was optimized using `numba` and needs to be compiled before performing the test.

## Run

In [14]:
pbar = tqdm(input_files, ncols=100)

for tissue_data_file in pbar:
    pbar.set_description(tissue_data_file.stem)

    # read
    data = pd.read_pickle(tissue_data_file)
    data = data.iloc[:BENCHMARK_N_TOP_GENE]
    # compute correlations
    start_time = time()

    data_corrs = CORRELATION_METHOD(data)

    end_time = time()
    elapsed_time = end_time - start_time
    display(elapsed_time)

    # save
    output_filename = f"{tissue_data_file.stem}-{method_name}-{TOP_N_GENES}-all.pkl"
    data_corrs.to_pickle(path=OUTPUT_DIR / output_filename)


  0%|                                                                         | 0/1 [00:00<?, ?it/s]


gtex_v8_data_whole_blood-var_pc_log2:   0%|                                   | 0/1 [00:00<?, ?it/s]

Device 0: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.7 / 12.5
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 23.52 GBytes (25251414016 bytes)
  GPU Clock rate:                                2535 MHz (2.54 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
Free memory: 22.915894 GB, Total memory: 23.517212 GB

Debug Info:
  n_features: 5000
  n_partitions: 9
  n_objects: 755
  n_feature_comp: 12497500
  n_aris: 1012297500
  batch_n_aris: 1012297500

Allocating host memory...
  Memory before allocation: Host Memory Usage: 1466 MB (RSS), 15305 MB (Virtual)
  Memory after allocation: Host Memory Usage: 1513 MB (RSS), 15352 MB (Virtual)
  Memory used: 47 MB

Allocating device memory...
  max_batch_feature_comp: 12497500
  Memory before allocation: Free memory: 22.915894 GB, Total memory: 23.517212 GB
  Memory after allocation: Free 

16.901498317718506


gtex_v8_data_whole_blood-var_pc_log2: 100%|███████████████████████████| 1/1 [00:17<00:00, 17.61s/it]


gtex_v8_data_whole_blood-var_pc_log2: 100%|███████████████████████████| 1/1 [00:17<00:00, 17.61s/it]


