Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update starters to support pandas 2.0 #127

Closed
astrojuanlu opened this issue May 5, 2023 · 1 comment · Fixed by #134
Closed

Update starters to support pandas 2.0 #127

astrojuanlu opened this issue May 5, 2023 · 1 comment · Fixed by #134

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented May 5, 2023

Description

Currently the starters can break if a user has pandas 2.0 installed. Update all starters so they can run fine with pandas 2.0 as well as older versions. This means updating the pin for kedro-datasets to ~=1.0 instead of ~=1.0.0.

Context

For example in spaceflights:

This should not be a problem if the user follows the normal workflow, but if they install pandas 2 separately, things break:

> pip install kedro pandas scikit-learn openpyxl pyarrow  # problems incoming
> kedro new --starter=spaceflights
> cd spaceflights
> kedro run  # uh oh
[05/05/23 15:25:57] INFO     Kedro project spaceflights                                                                                                                                                 session.py:360
[05/05/23 15:25:59] WARNING  /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/importlib/__init__.py:126: DeprecationWarning: `kedro.extras.datasets` is deprecated and will be removed in     warnings.py:109
                             Kedro 0.19, install `kedro-datasets` instead by running `pip install kedro-datasets`.                                                                                                    
                               return _bootstrap._gcd_import(name[level:], package, level)                                                                                                                            
                                                                                                                                                                                                                      
[05/05/23 15:26:00] INFO     Loading data from 'companies' (CSVDataSet)...                                                                                                                         data_catalog.py:343
                    INFO     Running node: preprocess_companies_node: preprocess_companies([companies]) -> [preprocessed_companies]                                                                        node.py:329
                    INFO     Saving data to 'preprocessed_companies' (ParquetDataSet)...                                                                                                           data_catalog.py:382
                    INFO     Completed 1 out of 6 tasks                                                                                                                                        sequential_runner.py:85
                    INFO     Loading data from 'shuttles' (ExcelDataSet)...                                                                                                                        data_catalog.py:343
[05/05/23 15:26:04] INFO     Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]                                                                            node.py:329
                    ERROR    Node 'preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]' failed with error:                                                                node.py:354
                             could not convert string to float: '$1325.0'                                                                                                                                             
                    WARNING  There are 5 nodes that have not run.                                                                                                                                        runner.py:205
                             You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command:                                                  
                               --from-nodes "preprocess_shuttles_node,create_model_input_table_node"                                                                                                                  
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/juan_cano/.micromamba/envs/_test310/bin/kedro:8 in <module>                               │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/cli. │
│ py:211 in main                                                                                   │
│                                                                                                  │
│   208 │   """                                                                                    │
│   209 │   _init_plugins()                                                                        │
│   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                                     │
│ ❱ 211 │   cli_collection()                                                                       │
│   212                                                                                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1130 in    │
│ __call__                                                                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/cli. │
│ py:139 in main                                                                                   │
│                                                                                                  │
│   136 │   │   )                                                                                  │
│   137 │   │                                                                                      │
│   138 │   │   try:                                                                               │
│ ❱ 139 │   │   │   super().main(                                                                  │
│   140 │   │   │   │   args=args,                                                                 │
│   141 │   │   │   │   prog_name=prog_name,                                                       │
│   142 │   │   │   │   complete_var=complete_var,                                                 │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1055 in    │
│ main                                                                                             │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1657 in    │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1404 in    │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:760 in     │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/proj │
│ ect.py:472 in run                                                                                │
│                                                                                                  │
│   469 │   with KedroSession.create(                                                              │
│   470 │   │   env=env, conf_source=conf_source, extra_params=params                              │
│   471 │   ) as session:                                                                          │
│ ❱ 472 │   │   session.run(                                                                       │
│   473 │   │   │   tags=tag,                                                                      │
│   474 │   │   │   runner=runner(is_async=is_async),                                              │
│   475 │   │   │   node_names=node_names,                                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/session/ │
│ session.py:426 in run                                                                            │
│                                                                                                  │
│   423 │   │   )                                                                                  │
│   424 │   │                                                                                      │
│   425 │   │   try:                                                                               │
│ ❱ 426 │   │   │   run_result = runner.run(                                                       │
│   427 │   │   │   │   filtered_pipeline, catalog, hook_manager, session_id                       │
│   428 │   │   │   )                                                                              │
│   429 │   │   │   self._run_called = True                                                        │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:9 │
│ 1 in run                                                                                         │
│                                                                                                  │
│    88 │   │   │   self._logger.info(                                                             │
│    89 │   │   │   │   "Asynchronous mode is enabled for loading and saving data"                 │
│    90 │   │   │   )                                                                              │
│ ❱  91 │   │   self._run(pipeline, catalog, hook_manager, session_id)                             │
│    92 │   │                                                                                      │
│    93 │   │   self._logger.info("Pipeline execution completed successfully.")                    │
│    94                                                                                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/sequential_ │
│ runner.py:70 in _run                                                                             │
│                                                                                                  │
│   67 │   │                                                                                       │
│   68 │   │   for exec_index, node in enumerate(nodes):                                           │
│   69 │   │   │   try:                                                                            │
│ ❱ 70 │   │   │   │   run_node(node, catalog, hook_manager, self._is_async, session_id)           │
│   71 │   │   │   │   done_nodes.add(node)                                                        │
│   72 │   │   │   except Exception:                                                               │
│   73 │   │   │   │   self._suggest_resume_scenario(pipeline, done_nodes, catalog)                │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 19 in run_node                                                                                   │
│                                                                                                  │
│   316 │   if is_async:                                                                           │
│   317 │   │   node = _run_node_async(node, catalog, hook_manager, session_id)                    │
│   318 │   else:                                                                                  │
│ ❱ 319 │   │   node = _run_node_sequential(node, catalog, hook_manager, session_id)               │
│   320 │                                                                                          │
│   321 │   for name in node.confirms:                                                             │
│   322 │   │   catalog.confirm(name)                                                              │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:4 │
│ 15 in _run_node_sequential                                                                       │
│                                                                                                  │
│   412 │   )                                                                                      │
│   413 │   inputs.update(additional_inputs)                                                       │
│   414 │                                                                                          │
│ ❱ 415 │   outputs = _call_node_run(                                                              │
│   416 │   │   node, catalog, inputs, is_async, hook_manager, session_id=session_id               │
│   417 │   )                                                                                      │
│   418                                                                                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 81 in _call_node_run                                                                             │
│                                                                                                  │
│   378 │   │   │   is_async=is_async,                                                             │
│   379 │   │   │   session_id=session_id,                                                         │
│   380 │   │   )                                                                                  │
│ ❱ 381 │   │   raise exc                                                                          │
│   382 │   hook_manager.hook.after_node_run(                                                      │
│   383 │   │   node=node,                                                                         │
│   384 │   │   catalog=catalog,                                                                   │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 71 in _call_node_run                                                                             │
│                                                                                                  │
│   368 ) -> Dict[str, Any]:                                                                       │
│   369 │   # pylint: disable=too-many-arguments                                                   │
│   370 │   try:                                                                                   │
│ ❱ 371 │   │   outputs = node.run(inputs)                                                         │
│   372 │   except Exception as exc:                                                               │
│   373 │   │   hook_manager.hook.on_node_error(                                                   │
│   374 │   │   │   error=exc,                                                                     │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 55 in run                                                                                        │
│                                                                                                  │
│   352 │   │   # purposely catch all exceptions                                                   │
│   353 │   │   except Exception as exc:                                                           │
│   354 │   │   │   self._logger.error("Node '%s' failed with error: \n%s", str(self), str(exc))   │
│ ❱ 355 │   │   │   raise exc                                                                      │
│   356 │                                                                                          │
│   357 │   def _run_with_no_inputs(self, inputs: Dict[str, Any]):                                 │
│   358 │   │   if inputs:                                                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 44 in run                                                                                        │
│                                                                                                  │
│   341 │   │   │   if not self._inputs:                                                           │
│   342 │   │   │   │   outputs = self._run_with_no_inputs(inputs)                                 │
│   343 │   │   │   elif isinstance(self._inputs, str):                                            │
│ ❱ 344 │   │   │   │   outputs = self._run_with_one_input(inputs, self._inputs)                   │
│   345 │   │   │   elif isinstance(self._inputs, list):                                           │
│   346 │   │   │   │   outputs = self._run_with_list(inputs, self._inputs)                        │
│   347 │   │   │   elif isinstance(self._inputs, dict):                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 75 in _run_with_one_input                                                                        │
│                                                                                                  │
│   372 │   │   │   │   f"{sorted(inputs.keys())}."                                                │
│   373 │   │   │   )                                                                              │
│   374 │   │                                                                                      │
│ ❱ 375 │   │   return self._func(inputs[node_input])                                              │
│   376 │                                                                                          │
│   377 │   def _run_with_list(self, inputs: Dict[str, Any], node_inputs: List[str]):              │
│   378 │   │   # Node inputs and provided run inputs should completely overlap                    │
│                                                                                                  │
│ /private/tmp/spaceflights/src/spaceflights/pipelines/data_processing/nodes.py:45 in              │
│ preprocess_shuttles                                                                              │
│                                                                                                  │
│   42 │   """                                                                                     │
│   43 │   shuttles["d_check_complete"] = _is_true(shuttles["d_check_complete"])                   │
│   44 │   shuttles["moon_clearance_complete"] = _is_true(shuttles["moon_clearance_complete"])     │
│ ❱ 45 │   shuttles["price"] = _parse_money(shuttles["price"])                                     │
│   46 │   return shuttles                                                                         │
│   47                                                                                             │
│   48                                                                                             │
│                                                                                                  │
│ /private/tmp/spaceflights/src/spaceflights/pipelines/data_processing/nodes.py:16 in _parse_money │
│                                                                                                  │
│   13                                                                                             │
│   14 def _parse_money(x: pd.Series) -> pd.Series:                                                │
│   15 │   x = x.str.replace("$", "", regex=True).str.replace(",", "")                             │
│ ❱ 16 │   x = x.astype(float)                                                                     │
│   17 │   return x                                                                                │
│   18                                                                                             │
│   19                                                                                             │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/generic.py:6 │
│ 324 in astype                                                                                    │
│                                                                                                  │
│    6321 │   │                                                                                    │
│    6322 │   │   else:                                                                            │
│    6323 │   │   │   # else, only a single dtype is given                                         │
│ ❱  6324 │   │   │   new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)           │
│    6325 │   │   │   return self._constructor(new_data).__finalize__(self, method="astype")       │
│    6326 │   │                                                                                    │
│    6327 │   │   # GH 33113: handle empty frame or series                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/ma │
│ nagers.py:451 in astype                                                                          │
│                                                                                                  │
│    448 │   │   elif using_copy_on_write():                                                       │
│    449 │   │   │   copy = False                                                                  │
│    450 │   │                                                                                     │
│ ❱  451 │   │   return self.apply(                                                                │
│    452 │   │   │   "astype",                                                                     │
│    453 │   │   │   dtype=dtype,                                                                  │
│    454 │   │   │   copy=copy,                                                                    │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/ma │
│ nagers.py:352 in apply                                                                           │
│                                                                                                  │
│    349 │   │   │   if callable(f):                                                               │
│    350 │   │   │   │   applied = b.apply(f, **kwargs)                                            │
│    351 │   │   │   else:                                                                         │
│ ❱  352 │   │   │   │   applied = getattr(b, f)(**kwargs)                                         │
│    353 │   │   │   result_blocks = extend_blocks(applied, result_blocks)                         │
│    354 │   │                                                                                     │
│    355 │   │   out = type(self).from_blocks(result_blocks, self.axes)                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/bl │
│ ocks.py:511 in astype                                                                            │
│                                                                                                  │
│    508 │   │   """                                                                               │
│    509 │   │   values = self.values                                                              │
│    510 │   │                                                                                     │
│ ❱  511 │   │   new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)           │
│    512 │   │                                                                                     │
│    513 │   │   new_values = maybe_coerce_values(new_values)                                      │
│    514                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:242 in astype_array_safe                                                                    │
│                                                                                                  │
│   239 │   │   dtype = dtype.numpy_dtype                                                          │
│   240 │                                                                                          │
│   241 │   try:                                                                                   │
│ ❱ 242 │   │   new_values = astype_array(values, dtype, copy=copy)                                │
│   243 │   except (ValueError, TypeError):                                                        │
│   244 │   │   # e.g. _astype_nansafe can fail on object-dtype of strings                         │
│   245 │   │   #  trying to convert to float                                                      │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:187 in astype_array                                                                         │
│                                                                                                  │
│   184 │   │   values = values.astype(dtype, copy=copy)                                           │
│   185 │                                                                                          │
│   186 │   else:                                                                                  │
│ ❱ 187 │   │   values = _astype_nansafe(values, dtype, copy=copy)                                 │
│   188 │                                                                                          │
│   189 │   # in pandas we don't store numpy str dtypes, so convert to object                      │
│   190 │   if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):                 │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:138 in _astype_nansafe                                                                      │
│                                                                                                  │
│   135 │                                                                                          │
│   136 │   if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):                       │
│   137 │   │   # Explicit copy, or required since NumPy can't view from / to object.              │
│ ❱ 138 │   │   return arr.astype(dtype, copy=True)                                                │
│   139 │                                                                                          │
│   140 │   return arr.astype(dtype, copy=copy)                                                    │
│   141                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: could not convert string to float: '$1325.0'

I was about to do a quick demonstration of the spaceflights pipeline, and instead of following the normal process, I installed the dependencies "by hand".

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.18.8
  • Python version used (python -V): 3.10.10
  • Operating system and version: macOS Ventura
@merelcht merelcht changed the title spaceflights starter is broken with pandas 2 Update starters to support pandas 2.0 Jun 5, 2023
@deepyaman
Copy link
Member

deepyaman commented Jun 5, 2023

@astrojuanlu This issue only affects the spaceflights starter; everything else upgrades fine. It may have something to do with an underlying error from numpy, where the block size changes on cast, but I haven't yet figured this out. Will keep you posted if I make progress.

Edit: JK, think this is because of bad code in Spaceflights: x = x.str.replace("$", "", regex=True). If this is regex, it's replacing start of string marker? Which is why you get ValueError: could not convert string to float: '$1325.0' further down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants