As a user I want to run multiple notebooks locally with Spark #496

MargrietGroenendijk · 2017-11-03T18:56:43Z

Expected behavior

run the below from multiple notebooks locally on the same kernel after installing a new kernel (Python3 with Pixiedust (Spark 2.2)

home_df = pixiedust.sampleData(6)

Actual behavior

Running 1 notebook is fine, but running the same code from a 2nd notebook gives an error.

Steps to reproduce the behavior

Install python3.6:
$ conda create -n py36 python3.6 anaconda
$ source activate py36

Add new kernel with Spark:
$ jupyter pixiedust install

Create 2 notebooks with new kernel: Python3 with Pixiedust (Spark 2.2)

Run in notebook 1:

import pixiedust
home_df = pixiedust.sampleData(6)

Run in notebook 2:

import pixiedust
home_df = pixiedust.sampleData(6)

This only works in notebook 1, notebook 2 gives the following error:

Downloading 'Million dollar home sales in NE Mass late 2016' from https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv
Downloaded 102051 bytes
Creating pySpark DataFrame for 'Million dollar home sales in NE Mass late 2016'. Please wait...
Successfully created pySpark DataFrame for 'Million dollar home sales in NE Mass late 2016'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-287cb731db50> in <module>()
----> 1 home_df = pixiedust.sampleData(6)

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/environment.py in wrapper(*args, **kwargs)
     96             kwargs.pop("fromScala")
     97             fromScala = True
---> 98         retValue = func(*args, **kwargs)
     99         if fromScala and retValue is not None:
    100             from pixiedust.utils.javaBridge import JavaWrapper

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in sampleData(dataId, type, forcePandas)
     82 def sampleData(dataId=None, type='csv', forcePandas=False):
     83     global dataDefs
---> 84     return SampleData(dataDefs, forcePandas).sampleData(dataId, type)
     85 
     86 class SampleData(object):

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in sampleData(self, dataId, type)
     95             self.printSampleDataList()
     96         elif str(dataId) in dataDefs:
---> 97             return self.loadSparkDataFrameFromSampleData(dataDefs[str(dataId)])
     98         elif "https://" in str(dataId) or "http://" in str(dataId) or "file://" in str(dataId):
     99             if type is 'json':

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in loadSparkDataFrameFromSampleData(self, dataDef)
    171 
    172     def loadSparkDataFrameFromSampleData(self, dataDef):
--> 173         return Downloader(dataDef, self.forcePandas).download(self.dataLoader)
    174 
    175     def loadSparkDataFrameFromUrl(self, dataUrl):

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in download(self, dataLoader)
    238                 print("Creating {1} DataFrame for '{0}'. Please wait...".\
    239                     format(displayName, 'pySpark' if Environment.hasSpark and not self.forcePandas else 'pandas'))
--> 240                 return dataLoader(path, self.dataDef.get("schema", None))
    241             finally:
    242                 print("Successfully created {1} DataFrame for '{0}'".\

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/sampleData.py in dataLoader(self, path, schema)
    121 
    122         if Environment.hasSpark and not self.forcePandas:
--> 123             if Environment.sparkVersion == 1:
    124                 print("Loading file using 'com.databricks.spark.csv'")
    125                 load = ShellAccess.sqlContext.read.format('com.databricks.spark.csv')

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/environment.py in <lambda>(cls, key)
     23 class Environment(with_metaclass( 
     24         type("",(type,),{
---> 25             "__getattr__":lambda cls, key: getattr(cls.env, key)
     26         }), object
     27     )):

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/__init__.py in inner(cls, *args, **kwargs)
     88             if hasattr(cls, fieldName) and getattr(cls, fieldName) is not None:
     89                 return getattr(cls, fieldName)
---> 90             retValue = func(cls, *args, **kwargs)
     91             setattr(cls, fieldName, retValue)
     92             return retValue

//anaconda/envs/py36/lib/python3.6/site-packages/pixiedust/utils/environment.py in sparkVersion(self)
     78             if not self.hasSpark:
     79                 return None
---> 80             version = ShellAccess["sc"].version
     81             if version.startswith('1.'):
     82                 return 1

AttributeError: 'NoneType' object has no attribute 'version'

The text was updated successfully, but these errors were encountered:

DTAIEB · 2017-11-03T19:11:06Z

Unfortunately, the error is coming from the Spark lower layers. The metadata derby db cannot be open, probably because multiple processes try to access it at the same time. The best we can do is to harden the PixieDust code to fallback on Pandas when this happens

#496 As a user I want to run multiple notebooks locally with Spark

DTAIEB pushed a commit that referenced this issue Nov 3, 2017

#496 As a user I want to run multiple notebooks locally with Spark

4876feb

DTAIEB pushed a commit that referenced this issue Nov 3, 2017

Merge pull request #497 from ibm-watson-data-lab/david-gateway-branch

f352a36

#496 As a user I want to run multiple notebooks locally with Spark

DTAIEB closed this as completed Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a user I want to run multiple notebooks locally with Spark #496

As a user I want to run multiple notebooks locally with Spark #496

MargrietGroenendijk commented Nov 3, 2017

DTAIEB commented Nov 3, 2017

As a user I want to run multiple notebooks locally with Spark #496

As a user I want to run multiple notebooks locally with Spark #496

Comments

MargrietGroenendijk commented Nov 3, 2017

Expected behavior

Actual behavior

Steps to reproduce the behavior

DTAIEB commented Nov 3, 2017