New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SW-539] Fix bug when pysparkling is executed in parallel on the same node #393
Conversation
py/pysparkling/initializer.py
Outdated
return sw_jar | ||
cache_path = get_cache_path(zip_filename) | ||
cached_jar = os.path.abspath("{}/sparkling_water/sparkling_water_assembly.jar".format(cache_path)) | ||
if os.path.exists(cached_jar) and os.path.isfile(cached_jar): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, can we disable caching totally? It could be dangerous, if we have cached old version of jar and upgraded sw.
I'm all for not using cache as the cache was also cause of problems before. We can disable it in this case since we need to extract the JAR any way, but we can extract it to temporary directory which will be cleaned at the end of H2OContext. |
f8d295c
to
8669a69
Compare
8669a69
to
3f0df23
Compare
@@ -139,12 +139,14 @@ def getOrCreate(spark, conf=None, **kwargs): | |||
|
|||
|
|||
def stop_with_jvm(self): | |||
Initializer.clean_temp_dir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: This seems causing print of stack-traces during spark shutdown. It cannot expect that it will be fully executed. The better solution is simply skip cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! Will work on this tomorrow
This bug fix introduce back the cache which @mmalohlava wrote a while ago. However it also changes the python egg cache path for tests so the tests always uses the correct latest artefact ( this was source of issues and also reason why the cache was removed ).
Also when testing SNAPSHOT versions locally on different sparkling-water we make sure to use temporary cache for python eggs, again to be sure we run on the latest code.
The cache is fine if it's used by the users on the released versions.