Just some things that are helpful to know about

In [2]:
import hail as hl
import numpy as np

# General NDArray stuff

You can make a hail ndarray from a numpy ndarray. This will probably be useful for what Patrick was mentioning about being able to use the same random matrix in both numpy and hail. Just generate a random matrix in numpy and then use `hl.nd.array` to turn it into a hail ndarray. 

In [5]:
hnd = hl.nd.array(np.arange(20).reshape((4, 5)))

If you have a single hail ndarray, you can use `eval` to turn it into a numpy array. `eval` is never really used in production hail pipelines with tables and whatnot, it's just used to experiment with small values.

In [6]:
hl.eval(hnd)

Initializing Hail with default parameters...
Running on Apache Spark version 2.4.0
SparkUI available at http://192.168.0.12:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.49-8cdca2917be5
LOGGING: writing to /Users/johnc/Code/hail/hail/hail-20200710-1051-0.2.49-8cdca2917be5.log


array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

Be careful: `show` is currently broken for ndarrays (the data prints in the wrong order). Just use `eval` if it's a single ndarray or `collect` when you have a table of ndarrays and you want to look at them.

In [8]:
hnd.show()

A hail array can also be turned into a numpy array.

In [15]:
hail_array = hl.array(hl.range(10))
print(hl.eval(hail_array))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [17]:
hl.eval(hl.nd.array(hail_array))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

Nested arrays work too:

In [18]:
nested_hail_array = hl.range(10).map(lambda x: hl.range(5))

In [21]:
hl.eval(nested_hail_array)

[[0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4]]

In [22]:
hl.eval(hl.nd.array(nested_hail_array))

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]], dtype=int32)

# Now with Tables and MatrixTables

Consider checking the cheat sheets, they're helpful: https://hail.is/docs/0.2/cheatsheets.html


Q: How do I get a matrixtable to experiment with?

A: I'd probably use balding nichols to generate a matrix, then write it to disk. Then just read it in for future computation, that way it's deterministic. So just do the write part once.

In [26]:
bnm_mt = hl.balding_nichols_model(3, 100, 1000)

2020-07-10 11:04:44 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 100 samples, and 1000 variants...


In [28]:
bnm_mt.write("balding_nichols_3_100_1000.mt")

2020-07-10 11:05:07 Hail: INFO: Coerced sorted dataset
2020-07-10 11:05:08 Hail: INFO: wrote matrix table with 1000 rows and 100 columns in 8 partitions to balding_nichols_3_100_1000.mt


In [39]:
mt = hl.read_matrix_table("balding_nichols_3_100_1000.mt")

Q: How do I check how my data is structured?

A: Use `describe`

In [40]:
mt.describe()

----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    }
----------------------------------------
Column fields:
    'sample_idx': int32
    'pop': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'ancestral_af': float64
    'af': array<float64>
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------


Note that the single entry field is GT, which is a call. You're going to want a number instead. Let's replace GT with a new field `n_alt`.

In [41]:
mt = mt.transmute_entries(n_alt = hl.float64(mt.GT.n_alt_alleles()))
mt.describe()

----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    }
----------------------------------------
Column fields:
    'sample_idx': int32
    'pop': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'ancestral_af': float64
    'af': array<float64>
----------------------------------------
Entry fields:
    'n_alt': float64
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------


Q: How do I turn a matrix table into a table?

A: There's more than one way, but I think you want `_localize_entries`. 

In [42]:
ht = mt.localize_entries("ent", "sample")

In [43]:
ht.describe()

----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    } 
    'sample': array<struct {
        sample_idx: int32, 
        pop: int32
    }> 
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'ancestral_af': float64 
    'af': array<float64> 
    'ent': array<struct {
        n_alt: float64
    }> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------


Notice how previously, we had an entry field called `n_alt` that was a `float64`, but now we just have a row field that is an `array<struct{n_alt: float64}>`. The struct is unnecessary though, let's remove that:

In [44]:
ht = ht.transmute(ent = ht.ent.map(lambda x: x.n_alt))

In [45]:
ht.describe()

----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    } 
    'sample': array<struct {
        sample_idx: int32, 
        pop: int32
    }> 
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'ancestral_af': float64 
    'af': array<float64> 
    'ent': array<float64> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------


I think you're going to need an undocumented function called `_group_within_partitions`. It takes a table of rows and groups several adjacent rows together. 

In [47]:
ght = ht._group_within_partitions("group_field", 10)

In [48]:
ght.describe()

----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    } 
    'sample': array<struct {
        sample_idx: int32, 
        pop: int32
    }> 
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'group_field': array<struct {
        locus: locus<GRCh37>, 
        alleles: array<str>, 
        ancestral_af: float64, 
        af: array<float64>, 
        ent: array<float64>
    }> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------


Now there's a field called `group_field` that contains an array of structs, where the `struct`'s fields are the old row fields. The arrays are of maximum length 10, as specified as an argument to `_group_within_partitions`. Now that you have an array of arrays, you can make ndarrays:

In [51]:
ght = ght.annotate(ent_nd = hl.nd.array(ght.group_field.map(lambda x: x.ent)))

In [52]:
ght.describe()

----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    } 
    'sample': array<struct {
        sample_idx: int32, 
        pop: int32
    }> 
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'group_field': array<struct {
        locus: locus<GRCh37>, 
        alleles: array<str>, 
        ancestral_af: float64, 
        af: array<float64>, 
        ent: array<float64>
    }> 
    'ent_nd': ndarray<float64, 2> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------
