Add Windows CI to kedro-starters #89

AhdraMeraliQB · 2022-05-18T15:29:15Z

Signed-off-by: Ahdra Merali ahdra.merali@quantumblack.com

Motivation and Context

Kedro-starters was lacking Windows CI. This PR rectifies this by adding Windows builds to the e2e testing. The setup has also been changed to make use of conda environments. The tests themselves have not been changed.

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Assigned myself to the PR
Added tests to cover my changes

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

.circleci/config.yml

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Try cloning env in before scenario Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try again Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try find line Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try find line 2 Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try find line 3 Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try without venv.main step Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try activating env at every windows build step Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Activate env before running tests/linting Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Clean up environment Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try adding hadoop setup step Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try adding hadoop setup step Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Clean up Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Make formatting consistent Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com>

AhdraMeraliQB · 2022-05-25T13:04:38Z

The test for running pyspark-iris fails for the windows builds, the error messages have been captured and are appended below. However, perhaps this should be addressed in a separate issue, as I would say it falls out of scope for the current one.

--

2022-05-24 13:52:37,380 - kedro.framework.session.session - INFO - Kedro project project-dummy
Missing Python executable 'python3', defaulting to 'C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\Lib\site-packages\pyspark\bin\..' for SPARK_HOME environment variable. Please install Python or specify the correct Python executable in PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/circleci/AppData/Local/Temp/tmpwn80ki8s/Lib/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/05/24 13:52:40 WARN Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548)
	at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569)
	at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:689)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1886)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1846)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1819)
	at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
	at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207)
	at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304)
	at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
	at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58)
	at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala)
	at org.apache.spark.util.Utils$.createTempDir(Utils.scala:335)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:344)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
	at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468)
	at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:516)
	... 22 more
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/24 13:52:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/05/24 13:52:43 WARN FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
2022-05-24 13:52:46,065 - py.warnings - WARNING - C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\framework\context\context.py:344: UserWarning: Credentials not found in your Kedro project config.
No files found in ['C:\\Users\\circleci\\AppData\\Local\\Temp\\tmp4z2amcqt\\project-dummy\\conf\\base', 'C:\\Users\\circleci\\AppData\\Local\\Temp\\tmp4z2amcqt\\project-dummy\\conf\\local'] matching the glob pattern(s): ['credentials*', 'credentials*/**', '**/credentials*']
  warn(f"Credentials not found in your Kedro project config.\n{str(exc)}")

2022-05-24 13:52:46,136 - py.warnings - WARNING - C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\hdfs\config.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib and slated for removal in Python 3.12; see the module's documentation for alternative uses
  from imp import load_source

2022-05-24 13:52:46,277 - kedro.io.data_catalog - INFO - Loading data from `example_iris_data` (SparkDataSet)...
2022-05-24 13:52:52,701 - kedro.io.data_catalog - INFO - Loading data from `parameters` (MemoryDataSet)...
2022-05-24 13:52:52,715 - kedro.pipeline.node - INFO - Running node: split: split_data([example_iris_data,parameters]) -> [X_train@pyspark,X_test@pyspark,y_train@pyspark,y_test@pyspark]
2022-05-24 13:52:52,815 - kedro.io.data_catalog - INFO - Saving data to `X_train@pyspark` (SparkDataSet)...
2022-05-24 13:52:53,107 - kedro.runner.sequential_runner - WARNING - There are 3 nodes that have not run.
You can resume the pipeline run by adding the following argument to your previous command:

Traceback (most recent call last):
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\io\core.py", line 210, in save
    self._save(data)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\extras\datasets\spark\spark_dataset.py", line 390, in _save
    data.write.save(save_path, self._file_format, **self._save_args)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\pyspark\sql\readwriter.py", line 740, in save
    self._jwrite.save(path)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
    return f(*a, **kw)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o80.save.
: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:736)
	at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:271)
	at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:287)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672)
	at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:178)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548)
	at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569)
	at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:689)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1886)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1846)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1819)
	at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
	at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207)
	at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304)
	at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
	at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58)
	at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala)
	at org.apache.spark.util.Utils$.createTempDir(Utils.scala:335)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:344)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
	at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468)
	at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:516)
	... 22 more


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\tools\miniconda3\envs\kedro-starters\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\tools\miniconda3\envs\kedro-starters\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\Scripts\kedro.exe\__main__.py", line 7, in <module>
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\framework\cli\cli.py", line 215, in main
    cli_collection()
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\framework\cli\cli.py", line 143, in main
    super().main(
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\framework\cli\project.py", line 352, in run
    session.run(
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\framework\session\session.py", line 397, in run
    run_result = runner.run(
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\runner\runner.py", line 87, in run
    self._run(pipeline, catalog, hook_manager, session_id)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\runner\sequential_runner.py", line 69, in _run
    run_node(node, catalog, hook_manager, self._is_async, session_id)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\runner\runner.py", line 211, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\runner\runner.py", line 311, in _run_node_sequential
    catalog.save(name, data)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\io\data_catalog.py", line 384, in save
    dataset.save(data)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\io\core.py", line 601, in save
    super().save(data)
  File "C:\Users\circleci\AppData\Local\Temp\tmpwn80ki8s\lib\site-packages\kedro\io\core.py", line 217, in save
    raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while saving data to data set SparkDataSet(file_format=parquet, filepath=C:/Users/circleci/AppData/Local/Temp/tmp4z2amcqt/project-dummy/data/02_intermediate/X_train.parquet, load_args={}, save_args={'mode': overwrite}).
An error occurred while calling o80.save.
: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:736)
	at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:271)
	at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:287)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672)
	at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:178)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548)
	at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569)
	at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:689)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1886)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1846)
	at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1819)
	at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
	at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207)
	at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304)
	at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
	at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
	at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58)
	at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala)
	at org.apache.spark.util.Utils$.createTempDir(Utils.scala:335)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:344)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
	at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468)
	at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:516)
	... 22 more

SUCCESS: The process with PID 1180 (child process of PID 2336) has been terminated.
SUCCESS: The process with PID 2336 (child process of PID 6228) has been terminated.
SUCCESS: The process with PID 6228 (child process of PID 6712) has been terminated.
    Then I should get a successful exit code                                 # features/steps/run_steps.py:82
      Assertion Failed: Expected exit code 0 but got 1
      Captured stdout:
      None
      None

merelcht

Great work @AhdraMeraliQB ! I'm happy with skipping the pyspark-iris test on Windows. Ideally we would add in a bit of logic so the builds pass and you don't need an admin to merge. Basically a check that says "if its the pyspark-iris test and running windows, skip the test". See: https://stackoverflow.com/questions/36482419/how-do-i-skip-a-test-in-the-behave-python-bdd-framework and https://github.com/kedro-org/kedro/blob/main/features/build_docs.feature#L3 + https://github.com/kedro-org/kedro/blob/main/features/environment.py#L41 for how you can add tags to behave test to skip.

test_requirements.txt

features/steps/sh_run.py

.circleci/config.yml

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

.circleci/config.yml

noklam · 2022-05-31T13:39:05Z

.circleci/config.yml

+          name: Install Python dependencies
+          command: |
+            pip install git+https://github.com/kedro-org/kedro@main
+            cd package && pip install -r requirements.txt -U


Where is this package folder coming from?

Nice catch - looks like we're installing kedro and it's requirements, and then installing them again under the kedro-starters requirements; I'll remove the redundant step.

SajidAlamQB

We might want JDK set up, we could try the win JDK setup we have on the Kedro repo. Having this installed may fix the pyspark-iris issues since it's reliant on Java.

noklam · 2022-05-31T13:56:19Z

features/environment.py

 def after_scenario(context, scenario):
-    rmtree(str(context.temp_dir))
-    rmtree(str(context.venv_dir))
-
-
-def rmtree(top):
-    if os.name != "posix":
-        for root, _, files in os.walk(top, topdown=False):
-            for name in files:
-                os.chmod(os.path.join(root, name), stat.S_IWUSR)
-    shutil.rmtree(top)
+    for path in _PATHS_TO_REMOVE:
+        # ignore errors when attempting to remove already removed directories
+        shutil.rmtree(path, ignore_errors=True)


Is this just refactoring or necessary to make it work? context and scenario are now redundant arguments. Does it still belong to after_scenario or it should be after_all?

With how behave is set up it always passes the context through, but scenario was indeed redundant and has been removed. Thanks for the catch!

noklam

Thanks for making this work! Leave a couples of small comments.

I also notice on the CircleCI there is a note, on kedro main repo we don't have this and we can see all the test results within CircleCI UI.

Your output is too large to display in the browser.

Only the last 400000 characters are displayed

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB · 2022-06-07T13:14:14Z

We might want JDK set up, we could try the win JDK setup we have on the Kedro repo. Having this installed may fix the pyspark-iris issues since it's reliant on Java.

@SajidAlamQB that's a good shout - will be sure to address in the pyspark-iris issue 👍

AhdraMeraliQB · 2022-06-07T13:15:56Z

I also notice on the CircleCI there is a note, on kedro main repo we don't have this and we can see all the test results within CircleCI UI.

@noklam I think that might be built into CircleCI for larger outputs; I'm unsure that anything was changed manually to trigger this 🤔

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Co-authored-by: Nok <nok.lam.chan@quantumblack.com>

merelcht

Awesome work @AhdraMeraliQB ! ⭐ 🎉 Great to finally have windows builds for our starters.

Add Placeholders

801d940

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB self-assigned this May 18, 2022

AhdraMeraliQB linked an issue May 18, 2022 that may be closed by this pull request

[KED-2622] Add Windows CI to kedro-starters #65

Closed

Ahdra Merali added 9 commits May 18, 2022 16:36

Populate Placeholders

2cb1ec4

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix YAML Errors pt 1

5e64580

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix YAML Errors pt 2

a5cf721

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix YAML Errors pt 3

a407b0d

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix YAML Errors pt 4

5757fec

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Translate to Windows pt 1

523caf1

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Correct Windows Job Commands

f275676

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Test Without Granting Access First

9259417

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Remove Sudo

312e1be

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB commented May 18, 2022

View reviewed changes

.circleci/config.yml Outdated Show resolved Hide resolved

Ahdra Merali and others added 12 commits May 19, 2022 15:20

Fix environment.py

804596f

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix environment.py pt 2

8bf089a

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix environment.py pt 3

fadeff3

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Add features/steps/sh_run.py

3dba099

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix environment.py pt 4

b58f072

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Delete Unused Step

5a85850

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Try Using Conda Pt 1

ab96bb6

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Revert environment.py

508ffdf

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix environment.py Take Two

624c09d

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Fix Missing Path

b14732a

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Trying somehting

aa188bd

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB requested a review from merelcht May 25, 2022 13:07

AhdraMeraliQB marked this pull request as ready for review May 25, 2022 13:07

AhdraMeraliQB mentioned this pull request May 25, 2022

pyspark-iris tests failing on Windows builds #95

Closed

merelcht reviewed May 25, 2022

View reviewed changes

test_requirements.txt Outdated Show resolved Hide resolved

features/steps/sh_run.py Outdated Show resolved Hide resolved

.circleci/config.yml Outdated Show resolved Hide resolved

Ahdra Merali added 2 commits May 25, 2022 17:07

Cleanup

4a9543f

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Skip pyspark-iris

97f8984

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB requested review from SajidAlamQB and noklam May 31, 2022 13:04

noklam reviewed May 31, 2022

View reviewed changes

.circleci/config.yml Outdated Show resolved Hide resolved

noklam reviewed May 31, 2022

View reviewed changes

SajidAlamQB reviewed May 31, 2022

View reviewed changes

noklam reviewed May 31, 2022

View reviewed changes

Remove before_all

dd68967

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Ahdra Merali and others added 4 commits June 7, 2022 14:20

Move directory removal to after_all

e77da28

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Add back context

5f549e3

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Remove redundant requirements installation

e62b914

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Update windows version

d27c0ad

Co-authored-by: Nok <nok.lam.chan@quantumblack.com>

merelcht approved these changes Jun 7, 2022

View reviewed changes

AhdraMeraliQB merged commit bab76ed into main Jun 7, 2022

AhdraMeraliQB deleted the feat/add-windows-ci branch June 7, 2022 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Windows CI to kedro-starters #89

Add Windows CI to kedro-starters #89

AhdraMeraliQB commented May 18, 2022 •

edited

Loading

AhdraMeraliQB commented May 25, 2022

merelcht left a comment

noklam May 31, 2022

AhdraMeraliQB Jun 7, 2022

SajidAlamQB left a comment •

edited

Loading

noklam May 31, 2022

AhdraMeraliQB Jun 7, 2022

noklam left a comment

AhdraMeraliQB commented Jun 7, 2022

AhdraMeraliQB commented Jun 7, 2022

merelcht left a comment

Add Windows CI to kedro-starters #89

Add Windows CI to kedro-starters #89

Conversation

AhdraMeraliQB commented May 18, 2022 • edited Loading

Motivation and Context

Checklist

AhdraMeraliQB commented May 25, 2022

merelcht left a comment

Choose a reason for hiding this comment

noklam May 31, 2022

Choose a reason for hiding this comment

AhdraMeraliQB Jun 7, 2022

Choose a reason for hiding this comment

SajidAlamQB left a comment • edited Loading

Choose a reason for hiding this comment

noklam May 31, 2022

Choose a reason for hiding this comment

AhdraMeraliQB Jun 7, 2022

Choose a reason for hiding this comment

noklam left a comment

Choose a reason for hiding this comment

AhdraMeraliQB commented Jun 7, 2022

AhdraMeraliQB commented Jun 7, 2022

merelcht left a comment

Choose a reason for hiding this comment

AhdraMeraliQB commented May 18, 2022 •

edited

Loading

SajidAlamQB left a comment •

edited

Loading