# Description

According to the settings specified below, this notebook:
 1. reads all the data from one source (GTEx, recount2, etc) according to the gene selection method (`GENE_SELECTION_STRATEGY`),
 2. runs a quick performance test using the correlation coefficient specified (`CORRELATION_METHOD`), and
 3. computes the correlation matrix across all the genes using the correlation coefficient specified.

# Modules

In [None]:
import pandas as pd

from clustermatch import conf
from clustermatch.corr import clustermatch

# Settings

In [None]:
# we don't have gene subsets for recount2
# GENE_SELECTION_STRATEGY = "var_raw"

In [None]:
def clustermatch_k2to5(data):
    # in recount2 we have ~37k samples per gene. For clustermatch, we need to
    # reduce the number of internal clusters generated, and here instead of the default 2 to 10,
    # we do 2 to 5
    n_clusters = list(range(2, 5 + 1))
    return clustermatch(data, internal_n_clusters=n_clusters)


CORRELATION_METHOD = clustermatch_k2to5

method_name = CORRELATION_METHOD.__name__
display(method_name)

In [None]:
PERFORMANCE_TEST_N_TOP_GENES = 500

# Paths

In [None]:
INPUT_FILE = conf.RECOUNT2["DATA_FILE"]
display(INPUT_FILE)

assert INPUT_FILE.exists()

In [None]:
OUTPUT_DIR = conf.RECOUNT2["SIMILARITY_MATRICES_DIR"]
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
display(OUTPUT_DIR)

# Data loading

In [None]:
data = pd.read_pickle(INPUT_FILE)

In [None]:
data.shape

In [None]:
data.head()

# Compute similarity

## Performance test

In [None]:
# select a subset of the genes
test_data = data.sample(n=PERFORMANCE_TEST_N_TOP_GENES, random_state=0)

In [None]:
test_data.shape

In [None]:
test_data.head()

This is a quick performance test of the correlation measure. The following line (`_tmp = ...`) is the setup code, which is needed in case the correlation method was optimized using `numba` and needs to be compiled before performing the test.

In [None]:
_tmp = CORRELATION_METHOD(test_data.iloc[:3])

display(_tmp.shape)
display(_tmp)

In [None]:
%timeit CORRELATION_METHOD(test_data)

## Run

In [None]:
# compute correlations
data_corrs = CORRELATION_METHOD(data)

In [None]:
display(data_corrs.shape)

assert data.shape[0] == data_corrs.shape[0]

In [None]:
data_corrs.head()

In [None]:
output_filename = OUTPUT_DIR / f"{INPUT_FILE.stem}-{method_name}.pkl"
display(output_filename)

In [None]:
# save
data_corrs.to_pickle(output_filename)