Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user I want to run multiple notebooks locally with Spark #496

Closed
MargrietGroenendijk opened this issue Nov 3, 2017 · 1 comment
Closed

Comments

@MargrietGroenendijk
Copy link
Contributor

Expected behavior

run the below from multiple notebooks locally on the same kernel after installing a new kernel (Python3 with Pixiedust (Spark 2.2)

home_df = pixiedust.sampleData(6)

Actual behavior

Running 1 notebook is fine, but running the same code from a 2nd notebook gives an error.

Steps to reproduce the behavior

Install python3.6:
$ conda create -n py36 python3.6 anaconda
$ source activate py36

Add new kernel with Spark:
$ jupyter pixiedust install

Create 2 notebooks with new kernel: Python3 with Pixiedust (Spark 2.2)

Run in notebook 1:

import pixiedust
home_df = pixiedust.sampleData(6)

Run in notebook 2:

import pixiedust
home_df = pixiedust.sampleData(6)

This only works in notebook 1, notebook 2 gives the following error:

Downloading 'Million dollar home sales in NE Mass late 2016' from https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv
Downloaded 102051 bytes
Creating pySpark DataFrame for 'Million dollar home sales in NE Mass late 2016'. Please wait...
Successfully created pySpark DataFrame for 'Million dollar home sales in NE Mass late 2016'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-287cb731db50> in <module>()
----> 1 home_df = pixiedust.sampleData(6)

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/environment.py in wrapper(*args, **kwargs)
     96             kwargs.pop("fromScala")
     97             fromScala = True
---> 98         retValue = func(*args, **kwargs)
     99         if fromScala and retValue is not None:
    100             from pixiedust.utils.javaBridge import JavaWrapper

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in sampleData(dataId, type, forcePandas)
     82 def sampleData(dataId=None, type='csv', forcePandas=False):
     83     global dataDefs
---> 84     return SampleData(dataDefs, forcePandas).sampleData(dataId, type)
     85 
     86 class SampleData(object):

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in sampleData(self, dataId, type)
     95             self.printSampleDataList()
     96         elif str(dataId) in dataDefs:
---> 97             return self.loadSparkDataFrameFromSampleData(dataDefs[str(dataId)])
     98         elif "https://" in str(dataId) or "http://" in str(dataId) or "file://" in str(dataId):
     99             if type is 'json':

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in loadSparkDataFrameFromSampleData(self, dataDef)
    171 
    172     def loadSparkDataFrameFromSampleData(self, dataDef):
--> 173         return Downloader(dataDef, self.forcePandas).download(self.dataLoader)
    174 
    175     def loadSparkDataFrameFromUrl(self, dataUrl):

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in download(self, dataLoader)
    238                 print("Creating {1} DataFrame for '{0}'. Please wait...".\
    239                     format(displayName, 'pySpark' if Environment.hasSpark and not self.forcePandas else 'pandas'))
--> 240                 return dataLoader(path, self.dataDef.get("schema", None))
    241             finally:
    242                 print("Successfully created {1} DataFrame for '{0}'".\

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in dataLoader(self, path, schema)
    121 
    122         if Environment.hasSpark and not self.forcePandas:
--> 123             if Environment.sparkVersion == 1:
    124                 print("Loading file using 'com.databricks.spark.csv'")
    125                 load = ShellAccess.sqlContext.read.format('com.databricks.spark.csv')

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/environment.py in <lambda>(cls, key)
     23 class Environment(with_metaclass( 
     24         type("",(type,),{
---> 25             "__getattr__":lambda cls, key: getattr(cls.env, key)
     26         }), object
     27     )):

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/__init__.py in inner(cls, *args, **kwargs)
     88             if hasattr(cls, fieldName) and getattr(cls, fieldName) is not None:
     89                 return getattr(cls, fieldName)
---> 90             retValue = func(cls, *args, **kwargs)
     91             setattr(cls, fieldName, retValue)
     92             return retValue

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/environment.py in sparkVersion(self)
     78             if not self.hasSpark:
     79                 return None
---> 80             version = ShellAccess["sc"].version
     81             if version.startswith('1.'):
     82                 return 1

AttributeError: 'NoneType' object has no attribute 'version'
@DTAIEB
Copy link
Member

DTAIEB commented Nov 3, 2017

Unfortunately, the error is coming from the Spark lower layers. The metadata derby db cannot be open, probably because multiple processes try to access it at the same time. The best we can do is to harden the PixieDust code to fallback on Pandas when this happens

DTAIEB pushed a commit that referenced this issue Nov 3, 2017
#496 As a user I want to run multiple notebooks locally with Spark
@DTAIEB DTAIEB closed this as completed Nov 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants