New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab #47
Comments
Hi @gotin, thank you for submitting this issue. The reason for this behaviour is that Jupyter Notebook/Lab server sets up current working directory to the directory where the notebook is saved, therefore relative paths in the catalog config now point to non-existing locations. Here are several options to mitigate this: Option 1: You can run Option 2: Alternatively, if you require to use vanilla Jupyter Notebook/Lab server, you can have a look at Option 3: You can also fix this manually by changing the current working directory to the root of your project at the top of your notebook. |
@DmitriiDeriabinQB , thanks for replying! As for Option1, I actually run kedro jupyter (notebook|lab) at the root directory of my project, but the result was what I wrote above. I didn't run kedro jupyter (notebook|lab) at the 'notebook' directory, it was the root directory of the project. So option1 and option3 didn't work. option2 isn't what I require so far. |
By the way, this happened only when I use SparkDataSet with relative local filepath. pandas type DataSets didn’t bring this issue even if relative local filepath were given. |
So far I wasn't able to reproduce this issue, since
Please note that I've changed the folder from This is somehow related to the dataset path definition for
And then paste the output here please? |
The output was below;
|
I'm gonna try to find a minimum way to reproduce this issue |
I just realized that this issue doesn't happen on the other projects I made from scratch. Maybe something wrong hides in the project happening this issue, which I haven't found. |
@gotin, thank you for looking into it. Please keep us updated if you manage to reproduce the issue. |
I just found the cause of this issue. In a python script(xxx.py) under [project dir]/src//nodes/, there was SparkSession initialization fragment which looks like the following; # in xxx.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() This was the cause. In [project dir]/src/[project name]/run.py, SparkSession initialization fragment codes were put, so kedro run command execution didn't suffer from this issue. # in run.py
from pyspark.sql import SparkSession
def init_spark_session(aws_access_key=None, aws_secret_key=None):
spark = (SparkSession.builder.master("local[*]")
.appName("kedro")
.config("spark.executor.memory", "24G")
.config("spark.executor.cores", "10")
.config('spark.driver.memory','4G')
.config("spark.sql.execution.arrow.enabled", "true")
.config("spark.driver.maxResultSize", "3g")
return spark
def main(
tags: Iterable[str] = None,
env: str = None,
runner: str = None,
):
# Load Catalog
conf = get_config(project_path=str(Path.cwd()), env=env)
catalog = create_catalog(config=conf)
spark = init_spark_session()
# Load the pipeline
pipeline = create_pipeline()
pipeline = pipeline.only_nodes_with_tags(*tags) if tags else pipeline But for notebook, the node function defininig python script having the SparkSession initialization showed above was loaded during notebook initialization process, so the SparkSession was initialized with the notebook's directory as the working directory. To avoid this issue, I moved SparkSession initialization fragment into the inside of function definitions of the node function defining script, as like the following; from .. import run
def func1(df, params):
spark = run.init_spark_session()
# some scripts using spark session This solved the issue I have been facing. I hope this comment helps some one facing same issue in the future. |
Description
SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab
Context
We can have a SparkDataSet entry which filepath has a relative local file path in catalog.yml which looks like this;
And when something.parquet is placed under <project_directory>/data/01_intermediate/something.parquet properly, kedro run successfully can load this parquet as long as the pipeline uses the 'something' data.
But on a jupyter notebook invoked by kedro jupyter notebook command, the following script doesn't load 'something' as expected.
Instead, it raises an exception looks like the following;
Reading spark_data_set.py (of kedro) and readwriter.py (of pyspark), I think it is caused by spark.read.load implementation. And apparently spark.read.load tries to read the data which is located at /Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet mistakenly, somehow, spark.read.load tries to resolve a given relative filepath referring from the directory of the notebook.
Steps to Reproduce
As I showed above,
Expected Result
load a 'something' dataframe
Actual Result
raises exceptions
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
Kedro version used (
pip show kedro
orkedro -V
):kedro, version 0.14.1
(anaconda3-2019.03)
Python version used (
python -V
):Python 3.7.3
(anaconda3-2019.03)
Operating system and version:
macOS Mojave version 10.14.5
My personal solution
Modifies the filepath as a absolute file path when a relative local file path is given in the SparkDataSet's init function code looks like the following;
The text was updated successfully, but these errors were encountered: