# Description

According to the settings specified below, this notebook:
 1. reads all the data from one source (GTEx, recount2, etc) according to the gene selection method (`GENE_SELECTION_STRATEGY`),
 2. runs a quick performance test using the correlation coefficient specified (`CORRELATION_METHOD`), and
 3. computes the correlation matrix across all the genes using the correlation coefficient specified.

# Modules

In [None]:
import pandas as pd
from tqdm import tqdm

from clustermatch import conf
from clustermatch.corr import spearman

# Settings

In [None]:
GENE_SELECTION_STRATEGY = "var_raw"

In [None]:
CORRELATION_METHOD = spearman

method_name = CORRELATION_METHOD.__name__
display(method_name)

In [None]:
PERFORMANCE_TEST_N_TOP_GENES = 500

# Paths

In [None]:
INPUT_DIR = conf.GTEX["GENE_SELECTION_DIR"]
display(INPUT_DIR)

assert INPUT_DIR.exists()

In [None]:
OUTPUT_DIR = conf.GTEX["SIMILARITY_MATRICES_DIR"]
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
display(OUTPUT_DIR)

# Data loading

In [None]:
input_files = sorted(list(INPUT_DIR.glob(f"*-{GENE_SELECTION_STRATEGY}.pkl")))
display(len(input_files))

assert len(input_files) == conf.GTEX["N_TISSUES"], len(input_files)
display(input_files[:5])

# Compute similarity

## Performance test

In [None]:
display(input_files[0])
test_data = pd.read_pickle(input_files[0])

In [None]:
test_data.shape

In [None]:
test_data.head()

This is a quick performance test of the correlation measure. The following line (`_tmp = ...`) is the setup code, which is needed in case the correlation method was optimized using `numba` and needs to be compiled before performing the test.

In [None]:
_tmp = CORRELATION_METHOD(test_data.iloc[:3])

display(_tmp.shape)
display(_tmp)

In [None]:
%timeit CORRELATION_METHOD(test_data.iloc[:PERFORMANCE_TEST_N_TOP_GENES])

## Run

In [None]:
pbar = tqdm(input_files, ncols=100)

for tissue_data_file in pbar:
    pbar.set_description(tissue_data_file.stem)

    # read
    data = pd.read_pickle(tissue_data_file)

    # compute correlations
    data_corrs = CORRELATION_METHOD(data)

    # save
    output_filename = f"{tissue_data_file.stem}-{method_name}.pkl"
    data_corrs.to_pickle(path=OUTPUT_DIR / output_filename)