Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_vcf -> group_by ignores filter #11562

Closed
chrisvittal opened this issue Mar 11, 2022 · 1 comment · Fixed by #11563
Closed

import_vcf -> group_by ignores filter #11562

chrisvittal opened this issue Mar 11, 2022 · 1 comment · Fixed by #11563
Assignees

Comments

@chrisvittal
Copy link
Collaborator

In [1]: import hail as hl

In [2]: hl.init()
2022-03-11 14:49:23 WARN  Utils:69 - Your hostname, metis resolves to a loopback address: 127.0.0.1; using 192.168.1.169 instead (on interface eth0)
2022-03-11 14:49:23 WARN  Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address
2022-03-11 14:49:23 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.1.2
SparkUI available at http://192.168.1.169:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.90-92e40ce648a8
LOGGING: writing to /home/cdv/src/hail/hail/hail-20220311-1449-0.2.90-92e40ce648a8.log

In [3]: mt = hl.import_vcf('src/test/resources/sample.vcf').filter_rows(False)

In [4]: ht = mt._localize_entries('entries', 'columns')

In [5]: groups = ht.group_by(the_key=ht.key).aggregate(value=hl.agg.collect(ht.row_value)).collect()
2022-03-11 14:50:08 Hail: INFO: Coerced sorted dataset
2022-03-11 14:50:10 Hail: INFO: Ordering unsorted dataset with network shuffle1]

In [6]: len(groups)
Out[6]: 346

In [7]: mt = mt.checkpoint('~/tmp/hail/sample.vcf.filtered.mt')
2022-03-11 14:51:14 Hail: INFO: wrote matrix table with 0 rows and 100 columns in 0 partitions to ~/tmp/hail/sample.vcf.filtered.mt

In [8]: ht = mt._localize_entries('entries', 'columns')

In [9]: groups_native = ht.group_by(the_key=ht.key).aggregate(value=hl.agg.collect(ht.row_value)).collect()

In [10]: len(groups_native)
Out[10]: 0
@chrisvittal
Copy link
Collaborator Author

In many cases, we ignore? the dropRows parameter on TableRead. I have no idea how this test is passing on main, but I think I have a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant