Port the salary example to Python #104

jhofman · 2021-11-02T14:13:23Z

Let's see if we can get the json for the key frames of this datamation when it's written in Python/Pandas:

"small_salary %>% 
  group_by(Degree) %>%
  summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

In Python the analysis could should be something more like this:

small_salary.groupby("Degree").mean("Salary")

where small_salary is a Pandas dataframe.

Challenges here:

What's the right line of Pandas code to mimic the R code above (the eval() function)?
How do we programmatically parse Python / Pandas code (from a string, for instance)?
We need the intermediate data frames at each step
And then need to generate datamation-compatible json blobs of vegalite specs

CSV of the data is here

cc @chisingh until your account is added to the repo.

The text was updated successfully, but these errors were encountered:

jhofman · 2021-11-05T13:56:03Z

@chisingh has a first cut working where you pass a string to be eval-ed and the dataframe to operate on. it would be great if we could somehow avoid having to pass the dataframe as well, but it's not a big deal if we can't get around this.

but maybe there's a way to solve both this problem and the next step (parsing the pipeline) at once: could we somehow inherit and extend the DataFrame class to get full access to what's going on with a pandas pipeline and then just tack on a .datamate() function to the class?

if we could it might solve the problem of having to parse the pandas pipeline manually because we'd have access to internal state and could see what happens at each step.

this would help with tricky cases like the following: df.groupby("Work").mean() implicitly takes the mean of the salary column, even though it's not written out. pandas clearly knows this and it would be nice if we didn't have to discover and duplicate these things, but could instead just catch each operation from pandas.

it's possible that we can learn something from @dodger487's dplython package or the similar pandas-ply package in terms of hacks pandas hacking.

jhofman · 2021-11-08T21:52:33Z

helpful tips from @dodger487, who recommended we check out the following libraries:

https://docs.dask.org/en/stable/10-minutes-to-dask.html
https://www.sympy.org/en/index.html
https://github.com/machow/siuba

jhofman · 2021-11-16T17:15:58Z

@giorgi-ghviniashvili: @chisingh has python code working that generates an array of specs and is now looking into rendering animations within a jupyter notebook.

can you two discuss how to integrate the javascript code into jupyter? it looks like it might be as simple as calling the App() with the right jupyter widget/plugin, similar to R's htmlwidget?

giorgi-ghviniashvili · 2021-11-17T19:44:05Z

There are bunch of questions on Stackoverflow about embedding js code to jupyter notebook.

https://stackoverflow.com/questions/48248987/inject-execute-js-code-to-ipython-notebook-and-forbid-its-further-execution-on-p

from IPython.display import display, HTML, Javascript using display

There is also cell magic commands : %%javascript or %%js:

@chisingh can you try to import all the dependencies in the notebook and then call App?

All the <script> tag, seen here, must be included as dependency. And then calling App like this.

jhofman · 2021-11-18T15:46:46Z

@chisingh got a first cut of running @giorgi-ghviniashvili's app inside of jupyter working during our call!

next steps are to:

get python to generate a spec that gets rendered in a notebook instead of using an existing spec
roll the boilerplate html/javscript generation for inline datamations into the datamation object itself
maintain unique ids (perhaps by cell) for different datamation instances w/in a notebook.

sharlagelfand · 2021-11-18T15:56:21Z

For testing out the notebook (and also testing the python spec generation), here are the specs for this chunk of code:

"small_salary %>% 
  group_by(Degree) %>%
  summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

small_salary.groupby("Degree").mean("Salary")

raw spec (the spec that the R code produces - for testing that the python generates specs that are consistent with what R does)

product spec (the spec that the JS code returns after being passed the raw spec - for testing in the notebook)

jhofman added python pipeline-parsing has to do with parsing data analysis pipelines in a base language (e.g., R) priority next action labels Nov 2, 2021

jhofman assigned chisingh Nov 2, 2021

jhofman self-assigned this Nov 9, 2021

This was referenced Nov 29, 2021

Python port of the salary data example #120

Merged

Create some unit tests #7

Closed

chisingh closed this as completed in #120 Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port the salary example to Python #104

Port the salary example to Python #104

jhofman commented Nov 2, 2021

jhofman commented Nov 5, 2021

jhofman commented Nov 8, 2021

jhofman commented Nov 16, 2021

giorgi-ghviniashvili commented Nov 17, 2021

jhofman commented Nov 18, 2021

sharlagelfand commented Nov 18, 2021

Port the salary example to Python #104

Port the salary example to Python #104

Comments

jhofman commented Nov 2, 2021

jhofman commented Nov 5, 2021

jhofman commented Nov 8, 2021

jhofman commented Nov 16, 2021

giorgi-ghviniashvili commented Nov 17, 2021

jhofman commented Nov 18, 2021

sharlagelfand commented Nov 18, 2021