Skip to content

[query] SemanticHash does not respect glob patterns #13915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danking opened this issue Oct 26, 2023 · 0 comments · Fixed by #13919
Closed

[query] SemanticHash does not respect glob patterns #13915

danking opened this issue Oct 26, 2023 · 0 comments · Fixed by #13919
Assignees

Comments

@danking
Copy link
Contributor

danking commented Oct 26, 2023

What happened?

Semantic hash assumes the params.files is a list of concrete file paths but it is a list of file paths with glob expressions. Consider the following example. Part of this ticket must also determine why this was not caught by test_glob.

(base) dking@wm28c-761 hail % gsutil cp ./src/test/resources/ldprune2.vcf gs://danking/chr1.vcf
Copying file://./src/test/resources/ldprune2.vcf [Content-Type=text/x-vcard]...
/ [1 files][ 11.5 KiB/ 11.5 KiB]                                                
Operation completed over 1 objects/11.5 KiB.                                     
(base) dking@wm28c-761 hail % gsutil cp ./src/test/resources/ldprune2.vcf gs://danking/chr2.vcf
Copying file://./src/test/resources/ldprune2.vcf [Content-Type=text/x-vcard]...
/ [1 files][ 11.5 KiB/ 11.5 KiB]                                                
Operation completed over 1 objects/11.5 KiB.                                     
(base) dking@wm28c-761 hail % ipython                                                          
Python 3.10.9 (main, Jan 11 2023, 09:18:18) [Clang 14.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import hail as hl
   ...: hl.import_vcf('gs://danking/chr*.vcf').count()
Initializing Hail with default parameters...
/Users/dking/miniconda3/lib/python3.10/site-packages/hailtop/aiocloud/aiogoogle/user_config.py:29: UserWarning: You have specified the GCS requester pays configuration in both your spark-defaults.conf (/Users/dking/miniconda3/lib/python3.10/site-packages/pyspark/conf/spark-defaults.conf) and either an explicit argument or through `hailctl config`. For GCS requester pays configuration, Hail first checks explicit arguments, then `hailctl config`, then spark-defaults.conf.
  warnings.warn(
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/Users/dking/miniconda3/lib/python3.10/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.3
SparkUI available at http://192.168.1.142:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.125-c4e2880b3279
LOGGING: writing to /Users/dking/projects/hail/hail/hail-20231026-0957-0.2.125-c4e2880b3279.log
--------------------------------------------------------------------------- / 1]
FatalError                                Traceback (most recent call last)
Cell In[1], line 2
      1 import hail as hl
----> 2 hl.import_vcf('gs://danking/chr*.vcf').count()

File ~/miniconda3/lib/python3.10/site-packages/hail/matrixtable.py:2631, in MatrixTable.count(self)
   2618 """Count the number of rows and columns in the matrix.
   2619 
   2620 Examples
   (...)
   2628     Number of rows, number of cols.
   2629 """
   2630 count_ir = ir.MatrixCount(self._mir)
-> 2631 return Env.backend().execute(count_ir)

File ~/miniconda3/lib/python3.10/site-packages/hail/backend/backend.py:180, in Backend.execute(self, ir, timed)
    178     result, timings = self._rpc(ActionTag.EXECUTE, payload)
    179 except FatalError as e:
--> 180     raise e.maybe_user_error(ir) from None
    181 if ir.typ == tvoid:
    182     value = None

File ~/miniconda3/lib/python3.10/site-packages/hail/backend/backend.py:178, in Backend.execute(self, ir, timed)
    176 payload = ExecutePayload(self._render_ir(ir), '{"name":"StreamBufferSpec"}', timed)
    177 try:
--> 178     result, timings = self._rpc(ActionTag.EXECUTE, payload)
    179 except FatalError as e:
    180     raise e.maybe_user_error(ir) from None

File ~/miniconda3/lib/python3.10/site-packages/hail/backend/py4j_backend.py:214, in Py4JBackend._rpc(self, action, payload)
    212 if resp.status_code >= 400:
    213     error_json = orjson.loads(resp.content)
--> 214     raise fatal_error_from_java_error_triplet(error_json['short'], error_json['expanded'], error_json['error_id'])
    215 return resp.content, resp.headers.get('X-Hail-Timings', '')

FatalError: FileNotFoundException: File not found: gs://danking/chr*.vcf

Java stack trace:
java.io.FileNotFoundException: File not found: gs://danking/chr*.vcf
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:984)
	at is.hail.io.fs.HadoopFS.fileListEntry(HadoopFS.scala:175)
	at is.hail.io.fs.HadoopFS.fileListEntry(HadoopFS.scala:87)
	at is.hail.io.fs.FS.fileListEntry(FS.scala:417)
	at is.hail.io.fs.FS.fileListEntry$(FS.scala:417)
	at is.hail.io.fs.HadoopFS.fileListEntry(HadoopFS.scala:87)
	at is.hail.expr.ir.analyses.SemanticHash$.getFileHash(SemanticHash.scala:373)
	at is.hail.expr.ir.analyses.SemanticHash$.$anonfun$encode$18(SemanticHash.scala:198)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at is.hail.expr.ir.analyses.SemanticHash$.encode(SemanticHash.scala:198)
	at is.hail.expr.ir.analyses.SemanticHash$.$anonfun$apply$6(SemanticHash.scala:42)
	at is.hail.expr.ir.analyses.SemanticHash$.$anonfun$apply$6$adapted(SemanticHash.scala:41)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at is.hail.expr.ir.analyses.SemanticHash$.go$1(SemanticHash.scala:41)
	at is.hail.expr.ir.analyses.SemanticHash$.$anonfun$apply$4(SemanticHash.scala:54)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.analyses.SemanticHash$.$anonfun$apply$1(SemanticHash.scala:34)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.analyses.SemanticHash$.apply(SemanticHash.scala:26)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:509)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$4(SparkBackend.scala:546)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$3(SparkBackend.scala:542)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$3$adapted(SparkBackend.scala:541)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:657)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:657)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$3(SparkBackend.scala:368)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:364)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:541)
	at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:51)
	at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
	at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
	at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
	at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:822)
	at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
	at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:794)
	at sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:199)
	at sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:544)
	at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:509)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.125-c4e2880b3279
Error summary: FileNotFoundException: File not found: gs://danking/chr*.vcf

Version

0.2.124

Relevant log output

No response

@ehigham ehigham self-assigned this Oct 26, 2023
ehigham added a commit to ehigham/hail that referenced this issue Oct 26, 2023
Fixes hail-is#13915
`MatrixVCFReader` accepts glob patterns (wildcards in glob names). This
bamboozled `SemanticHash` which had assumed all files had been resolved.
This change fixes this by adding explicit `FileNotFoundException`
handling to `SemanticHash` and replacing the `params.files` object of
`MatrixVCFReader` with the resolved paths.
danking pushed a commit that referenced this issue Oct 26, 2023
Fixes #13915
`MatrixVCFReader` accepts glob patterns (wildcards in glob names). This
bamboozled `SemanticHash` which had assumed all files had been resolved.
This change fixes this by adding explicit `FileNotFoundException`
handling to `SemanticHash` and replacing the `params.files` object of
`MatrixVCFReader` with the resolved paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants