# Description

According to the settings specified below, this notebook:
 1. reads all the data from one source (GTEx, recount2, etc) according to the gene selection method (`GENE_SELECTION_STRATEGY`),
 2. runs a quick performance test using the correlation coefficient specified (`CORRELATION_METHOD`), and
 3. computes the correlation matrix across all the genes using the correlation coefficient specified.

# Modules

In [1]:
import pandas as pd
from time import time
from tqdm import tqdm
from pathlib import Path

from ccc.utils import simplify_string
from ccc.corr import ccc_gpu

# Settings

In [2]:
GENE_SELECTION_STRATEGY = "var_pc_log2"
TOP_N_GENES = "top_5k"

In [3]:
# select the top 5 tissues (according to sample size, see nbs/05_preprocessing/00-gtex_v8-split_by_tissue.ipynb)
TISSUES = [
    # "Muscle - Skeletal",
    "Whole Blood",
    # "Skin - Sun Exposed (Lower leg)",
    # "Adipose - Subcutaneous",
    # "Artery - Tibial",
]

In [4]:
N_CPU_CORE = 24

In [5]:
CORRELATION_METHOD = lambda x: ccc_gpu(x, n_jobs=N_CPU_CORE)
CORRELATION_METHOD.__name__ = "ccc_gpu"

method_name = CORRELATION_METHOD.__name__
display(method_name)

'ccc_gpu'

In [6]:
BENCHMARK_N_TOP_GENE = 5000

# Paths

In [7]:
DATA_DIR = Path("/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8")
INPUT_DIR = DATA_DIR / "gene_selection" / "all"
display(INPUT_DIR)

assert INPUT_DIR.exists()

PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/gene_selection/all')

In [8]:
OUTPUT_DIR = DATA_DIR / "similarity_matrices" / TOP_N_GENES
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
display(OUTPUT_DIR)

PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/similarity_matrices/top_5k')

# Data loading

In [9]:
tissue_in_file_names = [f"_data_{simplify_string(t.lower())}-" for t in TISSUES]

In [10]:
input_files = sorted(list(INPUT_DIR.glob(f"*-{GENE_SELECTION_STRATEGY}.pkl")))
input_files = [
    f for f in input_files if any(tn in f.name for tn in tissue_in_file_names)
]
display(len(input_files))

assert len(input_files) == len(TISSUES), len(TISSUES)
display(input_files)

1

[PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/gene_selection/all/gtex_v8_data_whole_blood-var_pc_log2.pkl')]

# Compute similarity

## Performance test

In [12]:
display(input_files[0])
test_data = pd.read_pickle(input_files[0])

PosixPath('/mnt/data/proj_data/ccc-gpu/gene_expr/data/gtex_v8/gene_selection/all/gtex_v8_data_whole_blood-var_pc_log2.pkl')

In [13]:
test_data.shape

(56200, 755)

This is a quick performance test of the correlation measure. The following line (`_tmp = ...`) is the setup code, which is needed in case the correlation method was optimized using `numba` and needs to be compiled before performing the test.

## Run

In [15]:
pbar = tqdm(input_files, ncols=100)

for tissue_data_file in pbar:
    pbar.set_description(tissue_data_file.stem)

    # read
    data = pd.read_pickle(tissue_data_file)
    data = data.iloc[:BENCHMARK_N_TOP_GENE]
    # compute correlations
    start_time = time()

    data_corrs = CORRELATION_METHOD(data)

    end_time = time()
    elapsed_time = end_time - start_time
    display(elapsed_time)

    # save
    output_filename = f"{tissue_data_file.stem}-{method_name}.pkl"
    data_corrs.to_pickle(path=OUTPUT_DIR / output_filename)

gtex_v8_data_whole_blood-var_pc_log2:   0%|                                   | 0/1 [00:00<?, ?it/s]

11.180723428726196

gtex_v8_data_whole_blood-var_pc_log2: 100%|███████████████████████████| 1/1 [00:11<00:00, 11.34s/it]
