# Institute for Behavioral Genetics International Statistical Genetics 2021 Workshop 

## Advanced Hail Functionality

This notebook is a grab bag of more advanced Hail functionality.

### Approximate CDF

Normally computing quantiles or the median requires sorting an entire dataset. Hail uses a sophisticated data structure to get provably good approximations of all quantiles without sorting the data, providing buckets, or using unbounded memory.

In [1]:
import os
# Give Hail a bunch of RAM
os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 16G --driver-memory 16G pyspark-shell'

In [2]:
import hail as hl
hl.plot.output_notebook()

In [3]:
mt = hl.read_matrix_table('resources/hgdp-subset-3.mt')
mt = hl.variant_qc(mt)
cdf = mt.aggregate_rows(hl.agg.approx_cdf(mt.variant_qc.call_rate))
cdf

Initializing Hail with default parameters...


23/02/22 10:00:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/02/22 10:00:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/02/22 10:00:16 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/02/22 10:00:16 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


Running on Apache Spark version 3.3.0
SparkUI available at http://wm28c-761:4043
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.109-537f8f740a91
LOGGING: writing to /Users/dking/projects/2021_IBG_Hail/hail-20230222-1000-0.2.109-537f8f740a91.log

Struct(values=[0.0002409058058299205, 0.004095398699108649, 0.023367863165502288, 0.05299927728258251, 0.06600819079739821, 0.08359431462298242, 0.17778848470248132, 0.20284268850879306, 0.28956877860756447, 0.3078776198506384, 0.321127439171284, 0.3218501565887738, 0.3734039990363768, 0.3736449048422067, 0.45699831365935917, 0.4808479884365213, 0.5201156347867983, 0.5292700554083354, 0.5456516502047699, 0.5658877378944832, 0.5808238978559384, 0.5899783184774753, 0.6530956396049145, 0.6791134666345459, 0.6793543724403758, 0.7268128161888702, 0.7472898096844134, 0.7487352445193929, 0.7783666586364731, 0.781498434112262, 0.8026981450252951, 0.812093471452662, 0.8214887978800289, 0.827511443025777, 0.8279932546374368, 0.8383522042881233, 0.8446157552397012, 0.8614791616477957, 0.8706335822693327, 0.8759335099975909, 0.8850879306191279, 0.891110575764876, 0.9009877138039026, 0.9050831125030113, 0.9166465911828475, 0.921223801493616, 0.9255601059985545, 0.9260419176102144, 0.932305468561792

In [4]:
import bokeh.plotting as bp

def plot_cdf(cdf, title):
    values = cdf['values']
    values = values + [values[-1]]
    ranks = cdf['ranks']
    ranks = [x / ranks[-1] for x in ranks]

    p = bp.figure(title=title, plot_width=400, plot_height=400)
    p.step(x=[0] + values, y=[0] + ranks, line_width=2, line_color='black')

    hl.plot.show(p)
    
plot_cdf(cdf, 'Approximate CDF of Call Rate')

In [6]:
mt = hl.read_matrix_table('resources/hgdp-subset-3.mt')
mt = hl.variant_qc(mt)
cdf = mt.aggregate_rows(hl.agg.approx_cdf(mt.variant_qc.AF[0]))
plot_cdf(cdf, 'Approximate CDF of Reference AF')



You can also ask directly for the median:

In [7]:
mt.aggregate_rows(hl.agg.approx_median(mt.variant_qc.AF[0]))



0.9997590361445783

### PCA on Unusual Values

Flexible, general-purpose methods enable analysts to explore data sets with novel statistics.

In [11]:
mt = hl.read_matrix_table('resources/hgdp-subset-3.mt')
mt = mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)))
mt = mt.annotate_entries(
    is_missing = hl.is_missing(mt.GT)
)
mt = mt.annotate_rows(
    is_missing_stats = hl.agg.stats(mt.is_missing)
)
mt = mt.annotate_entries(
    normed_is_missing = (mt.is_missing - mt.is_missing_stats.mean) / mt.is_missing_stats.stdev
)
_, scores, _ = hl.pca(mt.normed_is_missing, k=2)
hl.plot.show(hl.plot.scatter(scores.scores[0], scores.scores[1]))



FatalError: HailException: Cannot create RowMatrix: filtered entry at row 0 and col 2

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 1 times, most recent failure: Lost task 3.0 in stage 6.0 (TID 27) (wm28c-761 executor driver): is.hail.utils.HailException: Cannot create RowMatrix: filtered entry at row 0 and col 2
	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.expr.ir.MatrixValue.$anonfun$toRowMatrix$3(MatrixValue.scala:306)
	at is.hail.expr.ir.MatrixValue.$anonfun$toRowMatrix$3$adapted(MatrixValue.scala:293)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.next(RichContextRDD.scala:77)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:224)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2323)
	at org.apache.spark.rdd.RDD.$anonfun$fold$1(RDD.scala:1174)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.fold(RDD.scala:1168)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$2(RDD.scala:1267)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1228)
	at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1214)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1214)
	at org.apache.spark.mllib.linalg.distributed.RowMatrix.multiplyGramianMatrixBy(RowMatrix.scala:94)
	at org.apache.spark.mllib.linalg.distributed.RowMatrix.$anonfun$computeSVD$8(RowMatrix.scala:385)
	at org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:103)
	at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:385)
	at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:311)
	at org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix.computeSVD(IndexedRowMatrix.scala:231)
	at is.hail.methods.PCA.execute(PCA.scala:41)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:52)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:3489)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:62)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:865)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:59)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:453)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:489)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:74)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:74)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:341)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:486)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:485)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)

is.hail.utils.HailException: Cannot create RowMatrix: filtered entry at row 0 and col 2
	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.expr.ir.MatrixValue.$anonfun$toRowMatrix$3(MatrixValue.scala:306)
	at is.hail.expr.ir.MatrixValue.$anonfun$toRowMatrix$3$adapted(MatrixValue.scala:293)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.next(RichContextRDD.scala:77)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:224)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)




Hail version: 0.2.109-537f8f740a91
Error summary: HailException: Cannot create RowMatrix: filtered entry at row 0 and col 2

[Stage 6:>                                                          (0 + 3) / 4]

### LD Prune

In [None]:
?hl.ld_prune

In [None]:
mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')
print(f'Before pruning we have: {mt.count_rows()}')
pruned_variants = hl.ld_prune(mt.GT)
pruned_variants.write('output/pruned_variants.ht', overwrite=True)
pruned_variants = hl.read_table('output/pruned_variants.ht')

mt = mt.filter_rows(hl.is_defined(pruned_variants[mt.row_key]))
print(f'After pruning we have: {mt.count_rows()}')

### Kinship Estimators

Hail supports a number of different kinship estimators.

Getting PC Relate to produce good-looking results is tricky! Here we see what happens when you don't quality control the variants well enough.

In [12]:
mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')

pc_kin = hl.pc_relate(mt.GT, 0.1, k=4, statistics='kin20', min_kinship=0.1)
pc_kin.write('output/pc_kin.ht', overwrite=True)
pc_kin = hl.read_table('output/pc_kin.ht')

hl.plot.show(
    hl.plot.scatter(
        pc_kin.kin,
        pc_kin.ibd0,
        width=400,
        height=400,
        size=3
    )
)

FatalError: HailException: No file or directory found at resources/qced-hgdp-1kg.mt

Java stack trace:
is.hail.utils.HailException: No file or directory found at resources/qced-hgdp-1kg.mt
	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:33)
	at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:74)
	at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:584)
	at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.109-537f8f740a91
Error summary: HailException: No file or directory found at resources/qced-hgdp-1kg.mt

In [None]:
mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')

king_kin = hl.king(mt.GT)
king_kin = king_kin.filter_entries(king_kin.phi > 0.1).entries()
king_kin.write('output/king_kin.ht', overwrite=True)
king_kin = hl.read_table('output/king_kin.ht')

hl.plot.show(
    hl.plot.histogram(
        king_kin.phi
    )
)

In [13]:
king_kin.filter(king_kin.phi < 0.45).show()

NameError: name 'king_kin' is not defined

Hail also supports identity-by-descent calculation but it's currently broken for the new Apple M1 chips because it uses some fast native code that hasn't been compiled for M1 yet. Expect a fix soon!

### Polygenic Score Calculation

In this section, I import a height polygenic score from the [PGS Catalog](https://www.pgscatalog.org/score/PGS000297/), and use it to calculate the polygenic score in our toy dataset. Our toy dataset does not have enough shared variants with the score to produce useful estimates, but the code below could be effectively applied to a larger, quality-controlled dataset.

In [14]:
ht = hl.import_table('resources/height-polygenic-score.txt', comment='#', impute=True)
ht = ht.key_by(
    locus = hl.locus(hl.str(ht.chr_name), ht.chr_position)
)
ht.write('output/height-polygenic-score.ht', overwrite=True)

FatalError: HailException: arguments refer to no files: Vector(resources/height-polygenic-score.txt).

Java stack trace:
is.hail.utils.HailException: arguments refer to no files: Vector(resources/height-polygenic-score.txt).
	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.expr.ir.StringTableReader$.getFileStatuses(StringTableReader.scala:41)
	at is.hail.expr.ir.StringTableReader$.apply(StringTableReader.scala:29)
	at is.hail.expr.ir.StringTableReader$.fromJValue(StringTableReader.scala:35)
	at is.hail.expr.ir.TableReader$.fromJValue(TableIR.scala:116)
	at is.hail.expr.ir.IRParser$.table_ir_1(Parser.scala:1596)
	at is.hail.expr.ir.IRParser$.$anonfun$table_ir$1(Parser.scala:1567)
	at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
	at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
	at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
	at is.hail.expr.ir.IRParser$.$anonfun$parse_value_ir$1(Parser.scala:2115)
	at is.hail.expr.ir.IRParser$.parse(Parser.scala:2111)
	at is.hail.expr.ir.IRParser$.parse_value_ir(Parser.scala:2115)
	at is.hail.backend.spark.SparkBackend.$anonfun$parse_value_ir$2(SparkBackend.scala:681)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:74)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:74)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:341)
	at is.hail.backend.spark.SparkBackend.$anonfun$parse_value_ir$1(SparkBackend.scala:680)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.parse_value_ir(SparkBackend.scala:679)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.109-537f8f740a91
Error summary: HailException: arguments refer to no files: Vector(resources/height-polygenic-score.txt).

In [None]:
ht = hl.read_table('output/height-polygenic-score.ht')

In [None]:
ht.show()

In [None]:
mt = hl.read_matrix_table('resources/1kg.mt')
mt = hl.variant_qc(mt)
mt = mt.annotate_rows(score=ht[mt.locus])

mt = mt.annotate_rows(is_flipped = (
    hl.case()
    .when(mt.score.effect_allele == mt.alleles[0], True)
    .when(mt.score.effect_allele == mt.alleles[1], False)
    .or_missing()
))
mt = mt.annotate_rows(
    mean_gt=2 * hl.if_else(mt.is_flipped, mt.variant_qc.AF[0], mt.variant_qc.AF[1])
)
mt = mt.annotate_entries(
    n_effect_alleles = hl.if_else(
        mt.is_flipped,
        2 - mt.GT.n_alt_alleles(),
        mt.GT.n_alt_alleles()
    )
)
mt = mt.annotate_cols(
    prs = hl.agg.sum(mt.score.effect_weight * hl.coalesce(mt.n_effect_alleles, mt.mean_gt)),
    n_useful_variants = hl.agg.sum(hl.is_defined(mt.score.effect_weight))
)
mt.cols().show()

### LD Score

Hail also has utilities for simulating phenotypes, calculating LD Scores, and running LD Score regression.

In [None]:
mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')
mt = hl.experimental.ldscsim.simulate_phenotypes(mt, mt.GT, h2=0.5)
mt.y.show()

In [15]:
betas = hl.linear_regression_rows(y=mt.y, x=mt.GT.n_alt_alleles(), covariates=[1.0])

AttributeError: MatrixTable instance has no field, method, or property 'y'
    Hint: use 'describe()' to show the names of all data fields.

In [16]:
betas.show()

NameError: name 'betas' is not defined

In [17]:
?hl.experimental.ld_score

In [18]:
ht_scores = hl.experimental.ld_score(entry_expr=mt.GT.n_alt_alleles(),
                                     locus_expr=mt.locus,
                                     radius=1e6)


betas = betas.annotate(z_score = betas.beta / betas.standard_error)
betas = betas.annotate(chi_sq_statistic = betas.z_score ** 2)

ht = mt.rows()

ht_results = hl.experimental.ld_score_regression(
    weight_expr=ht_scores[ht.locus].univariate,
    ld_score_expr=ht_scores[ht.locus].univariate,
    chi_sq_exprs=betas[ht.key].chi_sq_statistic,
    n_samples_exprs=betas[ht.key].n
)



FatalError: HailException: zip: length mismatch

Java stack trace:
is.hail.utils.HailException: zip: length mismatch
	at __C7438Compiled.__m7439split_ToArray_region2_40(Emit.scala)
	at __C7438Compiled.__m7439split_ToArray(Emit.scala)
	at __C7438Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:58)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:58)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:56)
	at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:137)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:137)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:56)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:453)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:489)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:74)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:74)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:341)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:486)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:485)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.109-537f8f740a91
Error summary: HailException: zip: length mismatch

In [None]:
ht_results.write('output/ldsr.ht', overwrite=True)
ldsr = hl.read_table('output/ldsr.ht')
ldsr.show()

### Annotation Database

The Hail team maintains a database of common variant annotations in Google Cloud Storage and S3. These commands will only work when executed inside a cluster with access to Google Cloud Storage or S3.

A full list of available annotations can be found [in the Hail docs](https://hail.is/docs/0.2/annotation_database_ui.html).

In [None]:
mt = hl.read_matrix_table('resources/1kg.mt')

db = hl.experimental.DB(region='us', cloud='aws')
mt = db.annotate_rows_db(
    mt,
    'CADD', 'GTEx_eQTL_Adipose_Subcutaneous_all_snp_gene_associations', 'gnomad_ld_scores_afr'
)
mt.rows().show()

### VEP

Hail also supports VEP annotation. This requires a specially configured cluster.

In [None]:
mt = hl.read_matrix_table('resources/1kg.mt')
mt = hl.vep(mt)
mt.vep.show()

In [None]:
mt = hl.read_matrix_table('resources/qced-hgdp-1kg.mt')
mt.vep.show()

In [None]:
mt = mt.annotate_rows(
    interesting_cnsq = mt.vep.transcript_consequences.find(lambda x: x.consequence_terms.contains("stop_gained"))
)
mt = mt.filter_rows(hl.is_defined(mt.interesting_cnsq))
mt.interesting_cnsq.show(n=30)