Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port the salary example to Python #104

Closed
jhofman opened this issue Nov 2, 2021 · 6 comments · Fixed by #120
Closed

Port the salary example to Python #104

jhofman opened this issue Nov 2, 2021 · 6 comments · Fixed by #120
Assignees
Labels
pipeline-parsing has to do with parsing data analysis pipelines in a base language (e.g., R) priority next action python

Comments

@jhofman
Copy link
Contributor

jhofman commented Nov 2, 2021

Let's see if we can get the json for the key frames of this datamation when it's written in Python/Pandas:

"small_salary %>% 
  group_by(Degree) %>%
  summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

In Python the analysis could should be something more like this:

small_salary.groupby("Degree").mean("Salary")

where small_salary is a Pandas dataframe.

Challenges here:

  1. What's the right line of Pandas code to mimic the R code above (the eval() function)?
  2. How do we programmatically parse Python / Pandas code (from a string, for instance)?
  3. We need the intermediate data frames at each step
  4. And then need to generate datamation-compatible json blobs of vegalite specs

CSV of the data is here

cc @chisingh until your account is added to the repo.

@jhofman jhofman added python pipeline-parsing has to do with parsing data analysis pipelines in a base language (e.g., R) priority next action labels Nov 2, 2021
@jhofman
Copy link
Contributor Author

jhofman commented Nov 5, 2021

@chisingh has a first cut working where you pass a string to be eval-ed and the dataframe to operate on. it would be great if we could somehow avoid having to pass the dataframe as well, but it's not a big deal if we can't get around this.

but maybe there's a way to solve both this problem and the next step (parsing the pipeline) at once: could we somehow inherit and extend the DataFrame class to get full access to what's going on with a pandas pipeline and then just tack on a .datamate() function to the class?

if we could it might solve the problem of having to parse the pandas pipeline manually because we'd have access to internal state and could see what happens at each step.

this would help with tricky cases like the following: df.groupby("Work").mean() implicitly takes the mean of the salary column, even though it's not written out. pandas clearly knows this and it would be nice if we didn't have to discover and duplicate these things, but could instead just catch each operation from pandas.

it's possible that we can learn something from @dodger487's dplython package or the similar pandas-ply package in terms of hacks pandas hacking.

@jhofman
Copy link
Contributor Author

jhofman commented Nov 8, 2021

helpful tips from @dodger487, who recommended we check out the following libraries:

https://docs.dask.org/en/stable/10-minutes-to-dask.html
https://www.sympy.org/en/index.html
https://github.com/machow/siuba

@jhofman jhofman self-assigned this Nov 9, 2021
@jhofman
Copy link
Contributor Author

jhofman commented Nov 16, 2021

@giorgi-ghviniashvili: @chisingh has python code working that generates an array of specs and is now looking into rendering animations within a jupyter notebook.

can you two discuss how to integrate the javascript code into jupyter? it looks like it might be as simple as calling the App() with the right jupyter widget/plugin, similar to R's htmlwidget?

@giorgi-ghviniashvili
Copy link
Collaborator

There are bunch of questions on Stackoverflow about embedding js code to jupyter notebook.

https://stackoverflow.com/questions/48248987/inject-execute-js-code-to-ipython-notebook-and-forbid-its-further-execution-on-p

from IPython.display import display, HTML, Javascript using display

There is also cell magic commands : %%javascript or %%js:
image

@chisingh can you try to import all the dependencies in the notebook and then call App?

All the <script> tag, seen here, must be included as dependency. And then calling App like this.

image

@jhofman
Copy link
Contributor Author

jhofman commented Nov 18, 2021

@chisingh got a first cut of running @giorgi-ghviniashvili's app inside of jupyter working during our call!

next steps are to:

  1. get python to generate a spec that gets rendered in a notebook instead of using an existing spec
  2. roll the boilerplate html/javscript generation for inline datamations into the datamation object itself
  3. maintain unique ids (perhaps by cell) for different datamation instances w/in a notebook.

@sharlagelfand
Copy link
Collaborator

For testing out the notebook (and also testing the python spec generation), here are the specs for this chunk of code:

"small_salary %>% 
  group_by(Degree) %>%
  summarize(mean = mean(Salary))" %>%
  datamation_sanddance()
small_salary.groupby("Degree").mean("Salary")

raw spec (the spec that the R code produces - for testing that the python generates specs that are consistent with what R does)

product spec (the spec that the JS code returns after being passed the raw spec - for testing in the notebook)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pipeline-parsing has to do with parsing data analysis pipelines in a base language (e.g., R) priority next action python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants