Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataframe to tibble #55

Closed
Mrostgaard opened this issue Sep 14, 2021 · 7 comments
Closed

dataframe to tibble #55

Mrostgaard opened this issue Sep 14, 2021 · 7 comments
Labels
raw python repl Issues with piping syntax in raw python REPL

Comments

@Mrostgaard
Copy link

Hello

I'm trying to read a csv file with pandas and pass it to a tibble to work with it. I couldn't find any documentation for this.

What I want to do is:

  1. Read csv file (currently using pandas for this) and converting it to a dataframe
  2. Take that dataframe do
df >> group_by(f.col1, f.col2) >> mutate(newCol1 = min(f.col-value), newCol2 = max(f.col-value))

When i try to do it with a pandas dataframe I get this error:

/python/lib/python3.8/site-packages/pipda/utils.py:161: UserWarning: Failed to fetch the node calling the function, call it with the original function.
  warnings.warn(
NotImplementedError: 'group_by' is not registered for type: <class 'pipda.symbolic.DirectRefAttr'>.

and then just a traceback of the most recent calls.

How should i properly load my csv file to use datar?

@pwwang
Copy link
Owner

pwwang commented Sep 14, 2021

Are you running in a raw python REPL?

@Mrostgaard
Copy link
Author

I'm running in Databricks on azure

@pwwang pwwang added the raw python repl Issues with piping syntax in raw python REPL label Sep 15, 2021
@pwwang
Copy link
Owner

pwwang commented Sep 15, 2021

Related: #45, #54

datar relies on the source code to detect the AST node, so we know whether the verbs are calling with piping syntax.
We can't detect AST node at runtime on Databricks notebooks as well as raw python REPL.

You have two solutions in such a case:

  1. Use regular calling:

    mutate(group_by(df, f.col1, f.col2), newCol1 = min(f.col-value), newCol2 = max(f.col-value))
  2. Use "all piping" mode:

    from pipda import options
    options.assume_all_piping  = True
    
    # imports and data loading
    df >> group_by(f.col1, f.col2) >> mutate(newCol1 = min(f.col-value), newCol2 = max(f.col-value))

In this "blind" environment, regular calling and piping calling are mutually exclusive. This means with the "all piping" mode, you have to even call df >> nrow(), instead of nrow(df), since nrow is registered as a verb. But min and max in the above example are okay, because they are registered as functions.

@Mrostgaard
Copy link
Author

It no longer fails on group_by so thanks for that, and an amazing response time!
Option number one doesn't seem to do exactly what I want it to do. When running the mutate function the df now is of type:
Verb(func=mutate, dataarg=True)
And it isn't mutated, when i run
print(df[newCol1])
It returns NameError: name 'newCol1' is not defined

Option two just fails with:
ValueError: Length mismatch: Expected axis has 1 elements, new values have 3 elements

This might be a problem with my implementation and not datar though?

Is there anything I need to do to collapse from Verb to dataframe or something?

@pwwang
Copy link
Owner

pwwang commented Sep 15, 2021

Could you provide a minimal reproducible code and data?

@Mrostgaard
Copy link
Author

Sure I will try

@Mrostgaard
Copy link
Author

I have looked at it and it is definitely my own mergings fault. Works fine with smaller inputs so somewhere I'm wrong, this is not the library's fault.

Thank you for a great library!

@pwwang pwwang mentioned this issue Sep 16, 2021
pwwang added a commit that referenced this issue Sep 16, 2021
* 📝 Add documentation for the "blind" environments (#45, #54, #55)

* 🩹 Fix trimws not importable from datar.all/datar.base

* ✨ Make as_date() return pd datetime types; Add as_pd_date() as an alias of pd.to_datetime() (#56)

* 🔖 0.5.1

* 🚨 Fix linting

* 👷 Deploy the docs on dev branch as well

* 💚 Fix docs deply in CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
raw python repl Issues with piping syntax in raw python REPL
Projects
None yet
Development

No branches or pull requests

2 participants