# Midas Tutorial

Hello! Please follow the tutorial to learn the basics of Midas. Be sure to play around until you are comfortable. You will have about 20 minutes. Should you have any questions, please feel free to ask Yifan, who will be present during the entire session.

## Introduction
Midas is a Jupyter notebook library/extension that aids data exploration by providing relevant static  visualizations. The key of Midas is that **the operations you perform in the interactive visualization space is also reflected in code space**---you will see what this means if you run the code cells below!

## Dataframe Operations
Midas is a special dataframe with syntax using that of the [data science module](http://data8.org/datascience/) from Data 8. Thw following are common operations that might be useful for querying:

* SELECT: `df.select(['col_name', 'more_col_name'])` --- Note that columns are referenced as strings.
* WHERER: `df.where('col_name', predicate)` -- the predicates are using lambda functions provided in the [`are`](http://data8.org/datascience/predicates.html) library, such as `are.above(8)` (as opposed to function overloading as seen in pandas, like `df[df['a']>8]`. If you wish to compare two columns, then you can use `.where('col1', preidcate, 'col2')`, such as `marbles.where("Price", are.above, "Amount")`.
* GROUP BY: `df.group('col_name', agg_fun)`, the default aggregation for a `group` is count, but you can also supple the aggregation by using existing aggregation functions such as Python's built in `sum`, `min`, `max` (or any of the `numpy` aggregation methods that work on arrays). The groupby operation is applied to all the columns that are not being grouped by on.
* Apply general methods: `df.apply(map_fun, new_column_name)` -- for instance, if you want to derive a new column that was the original column plus 1, with the new column called "incremented", the function you can call is `df.apply(lambda x: x + 1, 'incremented')`.

The following are useful for data modification:
* `append_column(label, values)` appends a new column, note that values must be created via `make_array` (so that it's numpy compliant) 
* `append(array_of_new_values)` appends a new row

Note that you can also access the columns as numpy arrays by using `df['col_name']`, which can be handy to use methods like `np.average(df['col_name'])`.

## Initiate Midas
Import the library and create an instance, `m = Midas()`, and we call the Midas runtime variable. Per a single notebook, you can only have one Midas instsance.
Then you will see that a dashboard-like area pops up to the right. You will see three areas, one is that of the data (yellow pane), showing the dataframes with acommpanying columns, and the others are the charts.

In [None]:
from midas import Midas
m = Midas()

# other utility libraries
import numpy as np
from datascience import Table, make_array
from datascience.predicates import are

## Load data
Midas takes in data from a few APIs, such as `from_df`, used below, which loads from pandas dataframe.
Note that you can also use

In [None]:
disaster_df = m.read_table('https://vega.github.io/vega-datasets/data/disasters.csv')

## Seeing data

Since a lot of basic visualization is highly predicatable, Midas attempts to visualize the basics for you directly.  However, sometimes, you may want to change the encoding, which is also very easy to do in Midas---just specify `mark`, `x`, `y`, and if you have three columns, specify the third column for `color` or `size`.

In [None]:
# 🟡 04:04 PM 🟡
disaster_df.append_column('Deaths_bin', disaster_df.apply(lambda x: int(x/200000.0) * 200000.0, 'Deaths'))
Deaths_distribution = disaster_df.group('Deaths_bin')

In [None]:
# 🟡 03:55 PM 🟡
disaster_df.append_column('Year_bin', disaster_df.apply(lambda x: int(x/20.0) * 20.0, 'Year'))
Year_distribution = disaster_df.group('Year_bin')

In [None]:
Entity_distribution = disaster_df.group('Entity')

## Getting distribution from clicking on the columns pane
Go ahead and click on the columns. After you click, two effects take place:
1. a cell will be created that contains dataframe calls that derives the new filtered values, as well as the visualization calls. You will see that they have color emoji such as 🟠, these are indicators for you to better visually navigate.
2. a chart is created that visualizes the data created in the pane on the right hand side

If the chart is the wrong encoding, or if the groupign query is inacurate, fell free to modify the code. You can click on the 📊icon to get the current definition to your clipboard. Paste the code to a cell, and the results will be reflected in the chart automatically.

In [None]:
# 🟡 03:37 PM 🟡
disaster_df.append_column('Year_bin', disaster_df.apply(lambda x: int(x/20.0) * 20.0, 'Year'))
Year_distribution = disaster_df.group('Year_bin')

In [None]:
# 🟡 03:37 PM 🟡
disaster_df.append_column('Deaths_bin', disaster_df.apply(lambda x: int(x/200000.0) * 200000.0, 'Deaths'))
Deaths_distribution = disaster_df.group('Deaths_bin')

## Accessing code with "📋"

If you want to take the code with the selection apploed, click on the 📋icon and then the code will be in your clipboar --- use it however you want!

## Snapshot of the current state with "📷"
Clicking on 📷 will insert a new cell with the current chart you see.

## Making selections
All the existing visualizations are equipped with the ability to **select**.

* With scatter plots, you can **brush** select on both the x and y axis.
* With bar charts, you can either brush to select the x axis items or click.
* With line charts, you can brush to select a range on the x axis.

When you perform a selection, you will observe two effects
1. the charts will be filtered with the new data
2. a cell will be generated with the selections you have made---the newly generated cells will keep on appending to the document based on the previous cell executed, and if you keep on interacting, the old interactions will be commented out and the new selection will be selected.

In [None]:
# reset selections
m.make_selections([])

## Navigating selections

You will see that your selections are shown in the selection pane (blue). You can rename and click on the selections to make the selections again.

## Accessing selections programmatically

Access selection in **predicate** form from the Midas runtime variable, `m` (you can assign it other names if you wish).
- most recent selection: `m.current_selection`
- all selections made in the past: `m.selection_history`

Access selection results in **data** form, you have the following options:
- access specific charts by the `<chart_name>.filtered_value`


In [None]:
m.current_selection

In [None]:
m.selection_history

## 🚧 Cleaning Data and Reactive State 🚧 (under developement)
Often, the data requires some trimming and modification for analysis to continue. For instance, from the distribution of fires, you notice that only a couple fire sizes are extreme outliers, and you decide to ignore these points. 

However, you might want to keep the previous visualizations and selections, for this, you can use the `update` method to **synchronize state**, where the charts would directly relfect the result of the changes.  In the cases where the selections are no longer relevant, such as when the relevant column is deleted, the charts will be deleted, but the cells will remain.  You can of course create a new dataframe from which to derive charts from, in order to preserve the old ones.  Note that you cannot update derived dataframes. So in our tutorial, only `disasters` can be updated.

## 🚧 Reactive Cells and Custom Visualizations 🚧(under developement)

A reactive cell means that Midas will run it after interactions.
Reactive cells can be used to inspect the state or computation related to the selection events.
The APIs are currently not as stable so not exposed here!

In [None]:
# more interesting examples to come!
%%reactive
print(m.current_selection)

## Using Joins for Analysis

When performing analysis we often want to connect different sources of information. For instance, for this analysis, we might be interested in locating whether the number of fire has to do with average rainfall or temperatures.

Even with joins, Midas can help you "link" the relevant tables together, given that you provide the information for how the two tables can be joined together, using the API, `a_df.can_join(another_df, 'column_name')`, where the two dataframes share teh same column name.

In [None]:
# load data from a csv file
stocks_df = m.read_table("https://vega.github.io/vega-datasets/data/sp500.csv")
# you can perform basic data cleaning 
stocks_df.append_column('year', table.apply(lambda x: x[-4:], 'date'))

In [None]:
# providing Midas with join information.
disaster_df.can_join(stocks_df, 'year')