## GTEx MatrixTables

To create MatrixTables containing all variant-gene associations tested in each tissue (including non-significant associations) for [GTEx](https://gtexportal.org/home/datasets) v8.

There are two MatrixTables, one is for the eQTL tissue-specific all SNP gene associations data and the other is for the sQTL tissue-specific all SNP gene associations data. 

Hail Tables for each tissue were already created previously from the data [here](https://console.cloud.google.com/storage/browser/hail-datasets-tmp/GTEx/GTEx_Analysis_v8_QTLs). For eQTL each table is ~7 GiB, and for sQTL each table is ~40 GiB or so. A README describing the fields in the GTEx QTL datasets is available [here](https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/README_eQTL_v8.txt).

Each MatrixTable has rows keyed by `["locus", "alleles"]`, and columns keyed by `["tissue"]`.

In [None]:
import subprocess
import hail as hl
hl.init()

First we can grab a list of the GTEx tissue names:

In [None]:
list_tissues = subprocess.run(["gsutil", "-u", "broad-ctsa", "ls", 
                                "gs://hail-datasets-tmp/GTEx/GTEx_Analysis_v8_QTLs/GTEx_Analysis_v8_eQTL_all_associations"], 
                               stdout=subprocess.PIPE)
tissue_files = list_tissues.stdout.decode("utf-8").split()
tissue_names = [x.split("/")[-1].split(".")[0] for x in tissue_files]

Take a peek at the tissue names we get to make sure they're what we expect:

In [None]:
tissue_names[0:5]

We can start with the process for the eQTL tables since they are smaller and a bit easier to work with. There are pretty much three steps here
  - Generate individual MatrixTables from the existing Hail Tables for each tissue type, there are 49 tissue types in total.
  - Perform a multi-way union cols (MWUC) on these 49 MatrixTables to create a single MatrixTable where there is a column for each tissue.
  - After the MWUC the resulting MatrixTable has pretty imbalanced partitions (some are KiBs, others are GiBs) so we have to repartition the unioned MatrixTable.

### eQTL tissue-specific all SNP gene associations

#### Generate individual MatrixTables from the existing Hail Tables for each tissue type (49 total).

Write output to `gs://hail-datasets-tmp/GTEx/eQTL_MatrixTables/`.

In [None]:
for tissue_name in tissue_names:
    print(f"eQTL: {tissue_name}")
    ht = hl.read_table(f"gs://hail-datasets-us/GTEx_eQTL_allpairs_{tissue_name}_v8_GRCh38.ht", _n_partitions=64)

    ht = ht.annotate(_gene_id = ht.gene_id, _tss_distance = ht.tss_distance)
    ht = ht.drop("variant_id", "metadata")
    ht = ht.key_by("locus", "alleles", "_gene_id", "_tss_distance")
    ht = ht.annotate(**{tissue_name: ht.row_value.drop("gene_id", "tss_distance")})
    ht = ht.select(tissue_name)

    mt = ht.to_matrix_table_row_major(columns=[tissue_name], col_field_name="tissue")
    mt = mt.checkpoint(
        f"gs://hail-datasets-tmp/GTEx/eQTL_MatrixTables/GTEx_eQTL_all_snp_gene_associations_{tissue_name}_v8_GRCh38.mt", 
        overwrite=False,
        _read_if_exists=True
    )

To ensure that everything is joined correctly later on, we add both the `_gene_id` and `tss_distance` fields to the table keys here. 

After the unioned MatrixTable is created we will re-key the rows to just be `["locus", "alleles"]`, and rename the fields above back to `gene_id` and `tss_distance` (they will now be row fields).

#### Perform multi-way union cols (MWUC) on MatrixTables generated above

The function below was used to take a list of MatrixTables and a list with the column key fields and output a single MatrixTable with the columns unioned.

In [None]:
from typing import List
def multi_way_union_cols(mts: List[hl.MatrixTable], column_keys: List[str]) -> hl.MatrixTable:
    missing_struct = "struct{ma_samples: int32, ma_count: int32, maf: float64, pval_nominal: float64, slope: float64, slope_se: float64}"
    
    mts = [mt._localize_entries("_mt_entries", "_mt_cols") for mt in mts]
    
    joined = hl.Table.multi_way_zip_join(mts, "_t_entries", "_t_cols")
    joined = joined.annotate(_t_entries_missing = joined._t_entries.map(lambda x: hl.is_missing(x)))
    
    rows = [(r, joined._t_entries.map(lambda x: x[r])[0])
            for r in joined._t_entries.dtype.element_type.fields 
            if r != "_mt_entries"]
    """
    Need to provide a dummy array<struct> for if tissues are not present to make sure missing elements not
    dropped from flattened array. 
    
    Otherwise we will get a HailException: length mismatch between entry array and column array in 
    'to_matrix_table_row_major'.
    """
    entries = [("_t_entries_flatten", 
                hl.flatten(
                    joined._t_entries.map(
                        lambda x: hl.if_else(
                            hl.is_defined(x), 
                            x._mt_entries,
                            hl.array([
                                hl.struct(
                                    ma_samples = hl.missing(hl.tint32), 
                                    ma_count = hl.missing(hl.tint32), 
                                    maf = hl.missing(hl.tfloat64), 
                                    pval_nominal = hl.missing(hl.tfloat64), 
                                    slope = hl.missing(hl.tfloat64), 
                                    slope_se = hl.missing(hl.tfloat64)
                                )
                            ])
                        )
                    )
                )
               )]
    joined = joined.annotate(**dict(rows + entries))
    """
    Also want to make sure that if entry is missing, it is replaced with a missing struct of the same form
    at the same index in the array.
    """
    joined = joined.annotate(_t_entries_new = hl.zip(joined._t_entries_missing, 
                                                     joined._t_entries_flatten, 
                                                     fill_missing=False))
    joined = joined.annotate(
        _t_entries_new = joined._t_entries_new.map(
            lambda x: hl.if_else(x[0] == True, hl.missing(missing_struct), x[1])
        )
    )    
    joined = joined.annotate_globals(_t_cols = hl.flatten(joined._t_cols.map(lambda x: x._mt_cols)))
    joined = joined.drop("_t_entries", "_t_entries_missing", "_t_entries_flatten")
    mt = joined._unlocalize_entries("_t_entries_new", "_t_cols", ["tissue"])
    return mt

Now we can read in each individual MatrixTable and add it to the list we will pass to `multi_way_union_cols`.

In [None]:
# Get list of file paths for individual eQTL MatrixTables
list_eqtl_mts = subprocess.run(["gsutil", "-u", "broad-ctsa", "ls", "gs://hail-datasets-tmp/GTEx/eQTL_MatrixTables"], 
                               stdout=subprocess.PIPE)
eqtl_mts = list_eqtl_mts.stdout.decode("utf-8").split()

# Load MatrixTables for each tissue type to store in list for MWUC
mts_list = []
for eqtl_mt in eqtl_mts:
    tissue_name = eqtl_mt.replace("gs://hail-datasets-tmp/GTEx/eQTL_MatrixTables/GTEx_eQTL_all_snp_gene_associations_", "")
    tissue_name = tissue_name.replace("_v8_GRCh38.mt/", "")
    print(tissue_name)
    
    mt = hl.read_matrix_table(eqtl_mt)
    mts_list.append(mt)

full_mt = multi_way_union_cols(mts_list, ["tissue"])
full_mt = full_mt.checkpoint("gs://hail-datasets-tmp/GTEx/checkpoints/GTEx_eQTL_all_snp_gene_associations_cols_unioned.mt", 
                             overwrite=False,
                             _read_if_exists=True)

#### Repartition unioned MatrixTable

After the MWUC the resulting MatrixTable has pretty imbalanced partitions (some are KiBs, others are GiBs) so we want to repartition the unioned MatrixTable. 

First we can re-key the rows of our MatrixTable:

In [None]:
# Re-key rows and repartition
full_mt = hl.read_matrix_table("gs://hail-datasets-tmp/GTEx/checkpoints/GTEx_eQTL_all_snp_gene_associations_cols_unioned.mt", 
                               _n_partitions=1000)
full_mt = full_mt.key_rows_by("locus", "alleles")
full_mt = full_mt.checkpoint("gs://hail-datasets-tmp/GTEx/GTEx_eQTL_all_snp_gene_associations.mt", 
                             overwrite=False, 
                             _read_if_exists=True)
full_mt.describe()

I tried reading in the MatrixTable with `_n_partitions=1000` to see how our partitions would look, but we still had a few that were much larger than the rest. So after this I ended up doing using `repartition` with a full shuffle, and it balanced things out.

In [None]:
# Add metadata to globals and write final MatrixTable to hail-datasets-us
full_mt = hl.read_matrix_table("gs://hail-datasets-tmp/GTEx/GTEx_eQTL_all_snp_gene_associations.mt")
full_mt = full_mt.repartition(1000, shuffle=True)

n_rows, n_cols = full_mt.count()
n_partitions = full_mt.n_partitions()

full_mt = full_mt.rename({"_gene_id": "gene_id", "_tss_distance": "tss_distance"})
full_mt = full_mt.annotate_globals(
    metadata = hl.struct(name = "GTEx_eQTL_all_snp_gene_associations",
                         reference_genome = "GRCh38",
                         n_rows = n_rows,
                         n_cols = n_cols,
                         n_partitions = n_partitions)
)
# Final eQTL MatrixTable is ~224 GiB w/ 1000 partitions
full_mt.write("gs://hail-datasets-us/GTEx_eQTL_all_snp_gene_associations_v8_GRCh38.mt")

And now we have a single MatrixTable for the GTEx eQTL data.

In [None]:
hl.read_matrix_table("gs://hail-datasets-us/GTEx_eQTL_all_snp_gene_associations_v8_GRCh38.mt").describe()