Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_vcf -> group_by ignores filter #11562

chrisvittal opened this issue Mar 11, 2022 · 1 comment · Fixed by #11563

import_vcf -> group_by ignores filter #11562

chrisvittal opened this issue Mar 11, 2022 · 1 comment · Fixed by #11563


Copy link

In [1]: import hail as hl

In [2]: hl.init()
2022-03-11 14:49:23 WARN  Utils:69 - Your hostname, metis resolves to a loopback address:; using instead (on interface eth0)
2022-03-11 14:49:23 WARN  Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address
2022-03-11 14:49:23 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.1.2
SparkUI available at
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.90-92e40ce648a8
LOGGING: writing to /home/cdv/src/hail/hail/hail-20220311-1449-0.2.90-92e40ce648a8.log

In [3]: mt = hl.import_vcf('src/test/resources/sample.vcf').filter_rows(False)

In [4]: ht = mt._localize_entries('entries', 'columns')

In [5]: groups = ht.group_by(the_key=ht.key).aggregate(value=hl.agg.collect(ht.row_value)).collect()
2022-03-11 14:50:08 Hail: INFO: Coerced sorted dataset
2022-03-11 14:50:10 Hail: INFO: Ordering unsorted dataset with network shuffle1]

In [6]: len(groups)
Out[6]: 346

In [7]: mt = mt.checkpoint('~/tmp/hail/')
2022-03-11 14:51:14 Hail: INFO: wrote matrix table with 0 rows and 100 columns in 0 partitions to ~/tmp/hail/

In [8]: ht = mt._localize_entries('entries', 'columns')

In [9]: groups_native = ht.group_by(the_key=ht.key).aggregate(value=hl.agg.collect(ht.row_value)).collect()

In [10]: len(groups_native)
Out[10]: 0
Copy link
Collaborator Author

In many cases, we ignore? the dropRows parameter on TableRead. I have no idea how this test is passing on main, but I think I have a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

Successfully merging a pull request may close this issue.

1 participant