In [1]:
from IPython.display import JSON

# Introduction

The following document describes how to easily generate user interfaces out of JSON configuraitons for any python library.

## Types of transformations

There are two type of transformations:


1. **Library functions:** These are methods that are part of a python library, like pandas
2. **Code snippests:** Cose-snippets that are used as a function but are not part of any library

# A first simple transformation

## Defining a simple transformation

The GUI of each transformation is defined as a simple json object that follows the [jsonschema spec](https://json-schema.org/specification.html), and is rendered into a GUI using [react-jsonschema-form](https://github.com/rjsf-team/react-jsonschema-form)

In [2]:
transformations = {
    "read_csv" : {
      "form" : {
        "properties" : {
          "filepath_or_buffer" : {
            "type" : "string",
            "title" : "file path",
            "description" : "Location of the file relative to the notebook. E.g. /Documents/Data/testdata.csv."
          },
          "sep" : {
            "type" : "string",
            "default" : ",",
            "title" : "separator",
            "description" : "Character used to separate columns (e.g. commas, semicolons, etc.)."
          },
          "decimal" : {
            "type" : "string",
            "default" : ".",
            "description" : "Character used to indicate decimals."
          },
          "header" : {
            "type" : "number",
            "description" : "Row to use for the column labels (first row is 0)."
          },
          "new table name" : {
            "type" : "string"
          }
        },
        "required" : [
          "filepath_or_buffer"
        ],
        "title" : "Read CSV file",
        "type" : "object",
        "callerObject" : "pd",
        "function" : "read_csv",
        "transformationType" : "dataLoading",
        "description" : "Load the csv in a table."
      },
      "uischema" : {
        "new table name" : {
          "ui:placeholder" : "Leave blank to modify selected table"
        }
      },
      "library" : {
        "name" : "pandas",
        "importStatement" : "import pandas as pd",
        "namespace" : "pd"
      },
      "keywords" : [
        "csv",
        "load",
        "read"
      ]
    }
}

**Each function has 4 objects**
- **form**: This follows the structure defined in [reat-jsonschema-form](https://react-jsonschema-form.readthedocs.io/en/latest/) for auto-generating forms from a JSON schema. Additionally, we have added a few custom fields for our implementation, which are required for generating code based on form inputs
- **uischema:** This defines the UI aspects of the form we are generating, according to the [react-jsonschema-form UI reference](https://react-jsonschema-form.readthedocs.io/en/latest/api-reference/uiSchema/).
- **library:** Details which packages is required to run this function and how to import it
- **keywords:** Search keys that will allow to find this funtion in the search inteface

![alt text](img/transformation_search.png)

Let's look at the form object

In [3]:
JSON(transformations['read_csv']['form'])

<IPython.core.display.JSON object>

**We see the following keys, that are part of the react jsonschema standard**
1. **properties**: Contains one element for each form field. Each field here will map to a parameter in a python function
2. **required**: Which of the parameters defined in properties is required (if not filled, returns a validation error)
3. **title**: The title of the function that is displayed in the UI. This should be easy to understand for a non-technical user. We follow a few conventions here:
    1. No underscores
    2. First letter is capital
3. **type**: Given JSONschema is used, we need to define the type of each element in the json tree. In this case, form is an object

Besides these, there are a few keys that are part of the eigendata implementation
1. **callerObject**: The object that can call this function 
    1. For example, read_csv is called from a pandas object, which we import as pd (a common way to import pandas, which we take as a convention). Thus we set the callerObject as pd
    2. [Learn more about how it is used in code generation](#understanding-code-generation)
2. **function**: function name in python (same as the outermost property)
3. **transformationType**: Some transformations have special properties, like in this case `dataLoading` shows up when there is no data loaded. See reference for other options for transformationType.

Let's see the properties object in more detail

In [4]:
JSON(transformations['read_csv']['form']['properties'])

<IPython.core.display.JSON object>

Fist we will look at the **new table name** field:

 **All of the transformations have a field called** `new table name`. This is a special field of type string (a simple text input) that will determine if the result of a transformation is save in a new variable (`df2 = df1.function(params)`) or not (`df1 = df1.function(params)`)

This field is rendered as a simple text input

![alt text](img/transformation_guide_string_input.png)

We see that it has some placeholder text "Leave blank to modify selected table". This is defined in the UI schema

In [5]:
JSON(transformations['read_csv']['uischema'],expanded=True)

<IPython.core.display.JSON object>

 Each property in the form object can have UI schema attached to it. By default, the New table name has a UI placeholder that explains how it works.

## Adding more fields 

In [6]:
JSON(transformations['read_csv']['form']['properties'], expanded=True)

<IPython.core.display.JSON object>

<a id='understanding-code-generation' a>

## Understanding code generation 
When you run the transformation clicking submit, the code is mapped from a json object to a python function call:

**Input: Form data**


```json
{sep: ",", decimal: ".", filepath_or_buffer: "titanic.csv"}
```


**Output: Generated code**

```python
data = pd.read_csv(
    sep=",", 
    decimal=".", 
    filepath_or_buffer="titanic.csv")
```

**Formula structure:** result_object = caller_object.function(parameters)


In this case, the ojbect is a dataframe
1. **result_object**: This comes from the “New table name” parameter. If empty, it overwrites the object that called it(`df1 = df1.function(params)`).
2. **caller_object** : This is the caller-object defined before in the json schema that defines which object calls the function
3. **function**: This is the function defined above in the json schema
4. **parameters**: These are the parameters passed from the form
    1. The way these are transformed is that it combines the key of the parameter with the value coming from the user input, using the equal sign that python functions use to specify parameters
    2. For example, <code>sep: ","</code>, becomes <code>sep="," </code>
    3. For readability, we format a parameter in each line
    4. Also notice that some parameters are implicit in python, meaning you don’t need to specify the <code>parametername=</code> and can instead rely on parameter order defined in the function signature. We belive it is better to always be explicit about which parameter you are setting

## Another example
Let's look at another example. Notice below how we are using a callerObject of type dataframe. This is because sort_values is a function available to Dataframe objects, so we define the callerObject as Dataframe.

In [7]:
transformations['sort_values'] = {
      "form" : {
        "properties" : {
          "by" : {
            "type" : "string",
            "$ref" : "#/definitions/column",
            "title" : "column"
          },
          "ascending" : {
            "type" : "string",
            "default" : "True",
            "enum" : [
              "True",
              "False"
            ],
            "codegenstyle" : "variable",
            "description" : "Sort the data ascending (ascending=True) or descending (ascending=False)."
          },
          "new table name" : {
            "type" : "string"
          }
        },
        "definitions" : {
          "column" : {
            "type" : "string",
            "enum" : []
          }
        },
        "required" : [
          "by",
          "ascending"
        ],
        "title" : "Sort values ",
        "type" : "object",
        "callerObject" : "DataFrame",
        "function" : "sort_values",
        "description" : "Sort the dataframe based on one column. "
      },
      "uischema" : {
        "new table name" : {
          "ui:placeholder" : "Leave blank to modify selected table"
        }
      },
      "library" : {
        "name" : "pandas",
        "importStatement" : "import pandas as pd"
      },
      "keywords" : [
        "arrange"
      ]
    }

JSON(transformations['sort_values'],expanded=True)

<IPython.core.display.JSON object>

# Adding UI elements: dropdown

## Adding a simple dropdown

Here we will create a single select property for one parameter (element of the properties object):

1. enum: Define the values in the dropdown
2. codegenstyle: This is required for the eigendata code generation. This is a custom parameter that allows us to customize how we map the form response to the python code. As we saw in the example above, the default is pass adding string characters. In this case, that would result in somethinglike: ascending = “True”. Passing the codegenstyle=”variable” flag will remove the quotation marks form the user input, processing ascending=True instead.
3. default: This is the default value that will be populated in the UI.

In [8]:
JSON(transformations['sort_values']['form']['properties']['ascending'], expanded=True)

<IPython.core.display.JSON object>

![alt text](img/transformation_guide_select_field.png)

## Adding a dropdown with the names of the columns

We can add dropdown values that depend on useful runtime information, like the names of the columns of a dataframe. We can tell eigendata to autopupulate using pandas columns by defining the jsonschema:

1. properties
    1. $ref: Here we are creating a reference to a definition object called columns. The idea is that if several parameters use the columns, we only have to define them once (learn more about definitions [here](https://react-jsonschema-form.readthedocs.io/en/latest/usage/definitions/)). 
2. definitions: This holds the skeleton to populate the column dropdown. The empty enum will be populated with a list of all string names automatically

In [9]:
JSON(transformations['sort_values']['form']['properties']['by'], expanded=True)

<IPython.core.display.JSON object>

In [10]:
JSON(transformations['sort_values']['form']['definitions'], expanded=True)

<IPython.core.display.JSON object>

![alt text](img/transformation_guide_select_columns.png)

If we look at the code generation, we see that althrough we load the definitions of a parameter called column, the parameter we have rendered is called "by"
```python
data = data.sort_values(
    ascending=False, 
    by="Age")
```

# Adding a multi-select dropdown

The example below describes how to create a GUI for a multi-select using the column-names:



1. Add a reference to definitions/columns (notice the s at the end)
2. Add the columns definition code as-is

In [11]:
transformations["get_dummies"] = {
      "form" : {
        "required" : [
          "columns"
        ],
        "definitions" : {
          "columns" : {
            "type" : "array",
            "uniqueItems" : True,
            "items" : {
              "type" : "string",
              "enum" : []
            }
          }
        },
        "properties" : {
          "columns" : {
            "$ref" : "#/definitions/columns",
            "description" : "Column(s) expanded into several columns for each distinct value with 0/1 indicators."
          },
          "dummy_na" : {
            "title" : "NaN column",
            "description" : "Include a column called NaN that indicates if the value is missing.",
            "default" : "False",
            "type" : "string",
            "codegenstyle" : "variable",
            "enum" : [
              "True",
              "False"
            ]
          },
          "new table name" : {
            "type" : "string"
          }
        },
        "title" : "Create dummies for a column (0/1 indicators)",
        "description" : "Create new columns with 0/1 for each unique value in a column. Returns table with additional columns.",
        "type" : "object",
        "callerObject" : "pd",
        "function" : "get_dummies",
        "selectionAsParameter" : True
      },
      "uischema" : {
        "new table name" : {
          "ui:placeholder" : "Leave blank to modify selected table"
        }
      },
      "library" : {
        "name" : "pandas",
        "importStatement" : "import pandas as pd"
      },
      "keywords" : [
        "1",
        "0",
        "columns"
      ]
    }

In [12]:
JSON(transformations['get_dummies']['form']['properties']['columns'], expanded=True)

<IPython.core.display.JSON object>

In [13]:
JSON(transformations['get_dummies']['form']['definitions'], expanded=True)

<IPython.core.display.JSON object>

![alt text](img/transformation_guide_multiselect_columns.png)

## Understanding the code generation

**Input:** **Form data & selected dataframe**



1. Form data: 
```json
{columns: ["Parents/Children Aboard", "Siblings/Spouses Aboard"]}
```
2. Selected dataframe: data

<strong>Output: Generated code</strong>


```python
data = data.drop(
    columns=["Parents/Children Aboard","Siblings/Spouses Aboard"])
```



The input from the form is mapped to a generic formula in the following way:

Formula: dataframe = object.function(parameters)
1. **dataframe**: Set to be the same than the selected datafreme given there is no input in New table name
2. **object** : Here, the callerObject DataFrame is replaced by the selected dataframe that is passed
3. **function**: This is the function defined above in the json schema

# Adding complex fields to the form

This example describes how to create UIs that are rendered as dictionaries in python
```python
property={'input1_field1' : 'input1_field2', 'input2_field1' : 'input2_field2'}`
```


The way this is implemented is using an array that has two sub-fields. The sub-fields are implementes nested property, i.e. a property inside another property:
1. In the example below, you see a property named dtype
2. The type is set to array
3. We defined an **items** element of type object
4. We define another properties element inside items (here is where the nesting takes place)

In [14]:
transformations['astype'] = {
      "form" : {
        "properties" : {
          "dtype" : {
            "type" : "array",
            "items" : {
              "type" : "object",
              "properties" : {
                "column" : {
                  "$ref" : "#/definitions/column"
                },
                "type" : {
                  "type" : "string",
                  "enum" : [
                    "int64",
                    "float64",
                    "string",
                    "bool",
                    "datetime64"
                  ],
                  "enumNames" : [
                    "Integer (e.g. 1, 2, 3)",
                    "Decimal number (e.g. 49.99)",
                    "Text",
                    "Boolean (True/False)",
                    "Date time"
                  ]
                }
              }
            },
            "title" : "Columns to reassign"
          },
          "new table name" : {
            "type" : "string"
          }
        },
        "definitions" : {
          "column" : {
            "type" : "string",
            "enum" : []
          }
        },
        "required" : [
          "dtype"
        ],
        "title" : "Assign column types",
        "type" : "object",
        "callerObject" : "DataFrame",
        "function" : "astype",
        "description" : "Change the data type of columns."
      },
      "uischema" : {
        "new table name" : {
          "ui:placeholder" : "Leave blank to modify selected table"
        },
        "dtype" : {
          "items" : {
            "classNames" : "side-by-side-fields",
            "column" : {
              "classNames" : "left-field"
            },
            "type" : {
              "classNames" : "right-field"
            }
          }
        }
      },
      "keywords" : [
        "convert",
        "data",
        "type",
        "datatype",
        "types",
        "change",
        "modify"
      ],
      "library" : {
        "name" : "pandas",
        "importStatement" : "import pandas as pd"
      }
    }

In [15]:
JSON(transformations['astype']['form']['properties'], expanded=True)

<IPython.core.display.JSON object>

# Transformations that return series

A series transformation is a transformation that return a [series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

Thus, the code we generate is of the type
```python
dataframe['Series name'] = dataframe['Series name'].function(parameters)
```

When we write a series transformation, we need to:

1. Specify an additional column parameter
2. Set the caller object to DataFrame[Series]
2. **Set codegenstyle: ignore, given that the column is not a real parameter of the function that nees to be rendered!**
3. The `new table property` is converted into a `new column property`, allowing us either to replace the column or define a new column

# Transformations that return variables

A transformation can also return a variable that is not a dataframe/series. We currently call "variable" a few of the core python data types:
- Int/Float
- String
- List

# Transformations with conditional logic

Sometimes, functions have more than one way of being called. This is usually hard for a user to understand when they read a documentation. We have implemented this pattern with conditional fields as [react-jsonschema-form dependencies](https://react-jsonschema-form.readthedocs.io/en/latest/usage/dependencies/).

First, we define a property that will control the different ways to render the form. If the parameter that controls the different ways to call a function is not an explicit parameter in the code, make sure to set codegenstyle to ignore

**Example with mode as an actual parameter in the code**

In [16]:
transformations['bin_column'] = {
      "form" : {
        "properties" : {
          "column" : {
            "$ref" : "#/definitions/column",
            "title" : "column",
            "codegenstyle" : "ignore"
          },
          "mode" : {
            "type" : "string",
            "enum" : [
              "size",
              "number",
              "quantiles",
              "custom"
            ],
            "enumNames" : [
              "Size of bin",
              "Number of bins",
              "Quantiles (e.g. quartiles)",
              "Custom interval ranges"
            ],
            "default" : "size"
          },
          "new column name" : {
            "type" : "string"
          }
        },
        "definitions" : {
          "column" : {
            "type" : "string",
            "enum" : []
          }
        },
        "required" : [
          "column",
          "mode"
        ],
        "dependencies" : {
          "mode" : {
            "oneOf" : [
              {
                "properties" : {
                  "mode" : {
                    "enum" : [
                      "size"
                    ]
                  },
                  "start" : {
                    "type" : "number",
                    "description" : "Non-inclusive: Sart the bucket above this number. For example, if start is set at 5, bucketing will start at 6."
                  },
                  "size" : {
                    "type" : "number",
                    "description" : "Inclusive: Take buckets of this size. For example, if start is set at 5, and size to 10, it will pick up elements from 6 to 15, including 15."
                  },
                  "end" : {
                    "type" : "number",
                    "description" : "Inclusive: Number at which to end the buckets. E.g. if start = 5, size =10 and end is set to 17, it will only create a category (5,15]. The next valid end would be 25."
                  }
                }
              },
              {
                "properties" : {
                  "mode" : {
                    "enum" : [
                      "number"
                    ]
                  },
                  "bin_number" : {
                    "type" : "number",
                    "title" : "bin number",
                    "description" : "Number of equally spaced bins used. Bin width is then (max-min)/bin number."
                  }
                }
              },
              {
                "properties" : {
                  "mode" : {
                    "enum" : [
                      "quantiles"
                    ]
                  },
                  "quantiles" : {
                    "type" : "number",
                    "description" : "Number of quantiles. 10 for deciles, 4 for quartiles, etc."
                  }
                }
              },
              {
                "properties" : {
                  "mode" : {
                    "enum" : [
                      "custom"
                    ]
                  },
                  "breaks" : {
                    "description" : "Set the break points for the custom ranges. Need to include start and end elements",
                    "type" : "array",
                    "items" : {
                      "type" : "number"
                    }
                  },
                  "closed" : {
                    "type" : "string",
                    "enum" : [
                      "left",
                      "right",
                      "both",
                      "neither"
                    ],
                    "description" : "For example, breaks of [1, 3, 5 o] closed on the right implies that nubmer 3 will be part of the interval (1-3]. If closed on the let, then 3 would be part of interval [3-5)"
                  }
                }
              }
            ]
          }
        },
        "title" : "[Column] Bin column",
        "callerObject" : "DataFrame[Series].fdt",
        "function" : "bin_column",
        "type" : "object",
        "returnType" : "series",
        "description" : "Creates bins/bukets for columns with numerical values."
      },
      "uischema" : {
        "new column name" : {
          "ui:placeholder" : "Leave blank to modify selected column"
        },
        "mode" : {
          "ui:widget" : "radio"
        },
        "ui:order" : [
          "*",
          "new column name"
        ]
      },
      "library" : {
        "name" : "fastdata.core",
        "importStatement" : "from fastdata.core import *"
      },
      "keywords" : [
        "bin",
        "bucket",
        "group"
      ]
    }

In [17]:
JSON(transformations['bin_column']['form']['properties'], expanded=True)

<IPython.core.display.JSON object>

# Transformations without return type

Some transformations do not return an object and just write to a notebook. In this case, you can set returnType to "none"

In [18]:
transformations['to_csv'] = {
      "form" : {
        "properties" : {
          "path_or_buf" : {
            "type" : "string",
            "title" : "file path",
            "description" : "Location of the file. You can also indicate a path to save it in a specific folder."
          },
          "index" : {
            "type" : "string",
            "default" : "False",
            "enum" : [
              "True",
              "False"
            ],
            "enumNames" : [
              "Include index",
              "Exclude index"
            ],
            "codegenstyle" : "variable"
          }
        },
        "required" : [
          "path_or_buf",
          "index"
        ],
        "type" : "object",
        "title" : "Save to CSV",
        "callerObject" : "DataFrame",
        "function" : "to_csv",
        "description" : "Save selected table to csv file.",
        "returnType" : "none"
      },
      "keywords" : [
        "write",
        "save",
        "store"
      ],
      "library" : {
        "name" : "pandas",
        "importStatement" : "import pandas as pd"
      }
    }

If the return type is not specified, it will be determined by the 

# Code snippets

# Selection as parameter & pure functions

Sometimes, a python transformation is not of the type `object = object.function(parameters)` but rather `object = function(parameters)`.

In this cases, many times the first positional parameter is the "subject" of the transformation and the rest of the parameters are configurations. We can treat these cases in the same way from a UI perspective by setting the flag `selectionAsParameter`. 

In the case above, the code will be generated in the following way
`pd.get_dummies(data, columns='...')`

This way we can give users one single way of interacting and avoid the confusion generated by the fact that there are two different syntax patterns

In [19]:
JSON(transformations['get_dummies']['form'], expanded=False)

<IPython.core.display.JSON object>

The above is even more interesting when we apply it to pure functions, which have no caller object

# Transformation reference

Examples of react-jsonschema-form [here](https://rjsf-team.github.io/react-jsonschema-form/)

## Form
### JSONschema form
1. **title**: Make the API more user-friendly
    1. No underscores, e.g. from `sort_by →  sort by`
    2. Use full words instead of abbreviations: `pat →  pattern`
    3. Help better understand parameters that take more than one type, e.g. sheet name can be
        1. `sheet name` if using the name
        2. `sheet number` if using an id
    4. If the API is very confusing, e.g. `na_sentinel → NaN value`
2. **description**: Add a description for each transformation.  
    1. Finalize sentences with dots
3. **required**
    9. Use property object name and not title
4. **definitions**
    
### Eigendata custom properties
1. **callerObject**: DataFrame is replaced by the selected dataframe. Series by the seleciton in the column parameter
    1. `pd`
    2. `DataFrame` (for dataframe functions)
    3. `DataFrame.fdt` (for fastdata functions)
    4. `DataFrame[Series]` (for series functions)
    5. `DataFrame[Series].str`
    6. …
2. **function**: function to be called
3. **transformationType**: If not defined, we assume it is a function
    1. `dataLoading` Will be shown when there is no data
    2. `property` Will be called withot parameter parenthesis (e.g. dataframe.shape)
3. **returnType**: Type of object that will be returned. We are using the "new variable name" or "new column name" to infer it. Otherwise it is set to dataframe.
    1. `none` the result will not be assigned to anything (e.g. dataframe.to_csv(...))
    2. `series` the result will be a series
    3. `dataframe` the result will be dataframe
    4. `variable` the result is nother type of native python object
4. **selectionAsParameter**: See examples above
  
    
### Form properties
#### JSONschema form
1. **title**:
    1. Only define if it is not clear (e.g. by parameter is not easy to understand)
    2. Always lowercase (except new table name)
    3. Use title if parameter name has undercores
    4. Keep it short, if too long use the description field
2. **description**
    5. Always add a description
3. **type**: Use only two types from jsonschema-form
    6. `string`: For single select and string input
    7. `number`: For number inputs (e.g. like round example)
    8. `array`: For multi-select or array input
4. **\$ref**: Used to define single-select columns, multi-select columns and single-select data frames. No need to define type if $ref is used. Options are:
    9. `#/definitions/column`
    10. `#/definitions/columns`
    11. `#/definitions/dataframes`
5. **default**: In case there is enum and enumNames, the default should be the enum value and not the enumNames 

#### Eigendata custom properties
1. **ED: codegenstyle**: Not specified: Pass the input with string `property='input'`
    1. `variable` Pass the input without strings  `property=input`
    2. `ignore` Used for series transformations
    3. `seriesColumn` Used when the series is not passed as a string but rather as an object with the syntax dataFrame['series']
    4. `seriesColumnList`