<a href="https://colab.research.google.com/github/Resource-Efficiency-Collective/coding-tutorials/blob/main/floweaver_tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Floweaver tutorials
This notebook goes through additional useful methods for plotting Sankeys in Floweaver and is split into 3 parts:

- Part 1 - Dimension tables
- Part 2 - System boundaries
- Part 3 - Colour intensity scales

Run the following two cells to setup the notebook.

Parts of the material for this section were taken from Luke Cullen's repository. Luke is a PhD student at the Resource Efficiency Collective at the University of Cambridge. His original repository is [here](https://github.com/Resource-Efficiency-Collective/coding-tutorials). 

In [None]:
# %%capture
"""Data download and package import"""
# Install floweaver and display widget packages
# %pip install floweaver ipysankeywidget

# Import packages
# import gdown, os
# from google.colab import files
import pandas as pd
import numpy as np
from floweaver import *

# Set the default size of Sankeys fit the documentation better.
size = dict(width=570, height=300)

In [None]:
"""Display setup"""
# Enable widget display for Sankeys in Colab
from google.colab import output
output.enable_custom_widget_manager()

## Part 1 - Dimension tables: efficiently adding details of processes and flows

In Floweaver basics we saw how to draw some simple Sankey diagrams and partition them in different ways, such as this:

![](https://github.com/Resource-Efficiency-Collective/coding-tutorials/blob/main/quickstart_example1.png?raw=1)

But to do the grouping on the right-hand side we had to explicitly list which people were "Men" and which were "Women", using a partition like this:

```python
customers_by_gender = Partition.Simple('process', [
    ('Men', ['Fred', 'James']),
    ('Women', ['Susan', 'Mary']),
])
```

We can show this type of information more efficiently -- and with less code -- by using *dimension tables*.

### Defintions

The table we've seen before is a **flow fact table** -- it lists basic information about each flow:

- *source*: where the flow comes from
- *target*: where the flow goes to
- *type* or *material*: what is flowing
- *value*: the size (in tonnes, GJ, £ etc) of the flow

An example of this type of table is shown at the top right of this diagram:

![](https://github.com/Resource-Efficiency-Collective/coding-tutorials/blob/main/dimension_tables.png?raw=1)

The **dimension tables** add extra information about the source/target and type of the flows (the diagram above also shows extra information about the time period the flow relates to, but we're not worrying about time in this tutorial). For example, "farm2" has a *location* attribute set to "Cambridge".

This tutorial will show how to use dimension tables in floweaver.

In [None]:
# Load the same basic data used in the basics
flows = pd.read_csv('example_data/simple_fruit_sales.csv')
display(flows)

In [None]:
# Load another table giving extra information about the 
# farms and customers. `index_col` says the first column
# can be used to lookup rows.
processes = pd.read_csv('example_data/simple_fruit_sales_processes.csv', index_col=0)
display(processes)

Each `id` in this table matches a `source` or `target` in the flows table above. We can use this extra information to build the Sankey.

Because we now have two tables (before we only had one so didn't have to worry) we must put them together into a Dataset:

In [None]:
dataset = Dataset(flows, dim_process=processes)

Now we can use the `type` column in the process table to more easily pick out the relevant processes:

In [None]:
nodes = {
    'farms': ProcessGroup('type == "farm"'),
    'customers': ProcessGroup('type == "customer"'),
}

Compare this to how the same thing was written in the basic tutorial:
```python
nodes = {
    'farms': ProcessGroup(['farm1', 'farm2', 'farm3', 
                           'farm4', 'farm5', 'farm6']),
    'customers': ProcessGroup(['James', 'Mary', 'Fred', 'Susan']),
}
```

Because we already know from the process dimension table that James, Mary, Fred and Susan are "customers", we don't have to list them all by name in the ProcessGroup definition -- we can write the *query* `type == "customer"` instead.

The rest of the Sankey diagram definition is the same as before:

In [None]:
ordering = [
    ['farms'],       # put "farms" on the left...
    ['customers'],   # ... and "customers" on the right.
]
bundles = [
    Bundle('farms', 'customers'),
]
sdd = SankeyDefinition(nodes, bundles, ordering)
weave(sdd, dataset).to_widget(**size)

Again, we need to set the partition on the ProcessGroups to see something interesting. Here again, we can use the process dimension table to make this easier:

In [None]:
# Create a Partition which splits based on the `sex` column
# of the dimension table
customers_by_gender = Partition.Simple('process.sex', 
                                       ['Men', 'Women'])

nodes['customers'].partition = customers_by_gender
weave(sdd, dataset).to_widget(**size)

For reference, this is what we wrote before in the basic tutorial:
```python
customers_by_gender = Partition.Simple('process', [
    ('Men', ['Fred', 'James']),
    ('Women', ['Susan', 'Mary']),
])
```

And we can use other columns of the dimension table to set other partitions:

In [None]:
farms_by_organic = Partition.Simple('process.organic', ['yes', 'no'])

nodes['farms'].partition = farms_by_organic
weave(sdd, dataset).to_widget(**size)

Finally, a tip for doing quick exploration of the data with partitions: you can automatically get a Partition which includes all the values that actually occur in your dataset using the `dataset.partition` method:

In [None]:
nodes['farms'].partition = dataset.partition('source.organic')

# This should be the same as before
weave(sdd, dataset).to_widget(**size)

###Summary
The process dimension table adds extra information about each process. You can use this extra information to:

Pick out the processes you want to include in a ProcessGroup (selection); and
Split apart groups of processes based on different attributes (partitions).
Things to try:

Make a diagram showing the locations of farms on the left and the locations of customers on the right

## Part 2 - System boundaries

Often we don't want to show all of the data in one Sankey diagram: you focus on one part of the system. But we still want conservation of mass (or whatever is being shown in the diagram) to work, so we end up with flows to & from "elsewhere". These can also be thought of as *imports* and *exports*.

Let's start by recreating the basic example:

In [None]:
# Same partitions as the Quickstart tutorial
farms_with_other = Partition.Simple('process', [
    'farm1',
    'farm2',
    'farm3',
    ('other', ['farm4', 'farm5', 'farm6']),
])

customers_by_name = Partition.Simple('process', [
    'James', 'Mary', 'Fred', 'Susan'
])

# Define the nodes, this time setting the partition from the start
nodes = {
    'farms': ProcessGroup(['farm1', 'farm2', 'farm3', 
                           'farm4', 'farm5', 'farm6'],
                          partition=farms_with_other),
    'customers': ProcessGroup(['James', 'Mary', 'Fred', 'Susan'],
                              partition=customers_by_name),
}

# Ordering and bundles as before
ordering = [
    ['farms'],       # put "farms" on the left...
    ['customers'],   # ... and "customers" on the right.
]

bundles = [
    Bundle('farms', 'customers'),
]

In [None]:
sdd = SankeyDefinition(nodes, bundles, ordering)
weave(sdd, flows).to_widget(**size)

What happens if we remove `farm2` from the ProcessGroup?

In [None]:
nodes['farms'].selection = [
    'farm1', 'farm3', 'farm4', 'farm5', 'farm6'
]
weave(sdd, flows).to_widget(**size)

The flow is still there! But it is labelled with a little arrow to show that it is coming "from elsewhere". This is important because we are still showing Susan and Fred in the diagram, and they get fruit from farm2. If we didn't show those flows, Susan's and Fred's inputs and outputs would not balance.

Try now removing Susan and Fred from the diagram:

In [None]:
nodes['customers'].selection = ['James', 'Mary']
weave(sdd, flows).to_widget(**size)

Now they're gone, we no longer see the incoming flows from `farm2`. But we see some outgoing flows "to elsewhere" from `farm3` and the `other` group. This is because `farm3` is within the system boundary -- it is shown in the diagram -- so its output flow has to go somewhere.

### Controlling Elsewhere flows

These flows are added automatically to make sure that mass is conserved, but because they are automatic, we have little control over them. By explicitly adding a flow to or from Elsewhere to the diagram, we can control where they appear and what they look like.

To do this, add a Waypoint for the outgoing flows to 'pass through' on their way across the system boundary:

In [None]:
# Define a new Waypoint
nodes['exports'] = Waypoint(title='exports here')

# Update the ordering to include the waypoint
ordering = [
    ['farms'],                  #     put "farms" on the left...
    ['customers', 'exports'],   # ... and "exports" below "customers"
]                               #     on the right.

# Add a new bundle from "farms" to Elsewhere, via the waypoint
bundles = [
    Bundle('farms', 'customers'),
    Bundle('farms', Elsewhere, waypoints=['exports']),
]

sdd = SankeyDefinition(nodes, bundles, ordering)
weave(sdd, flows).to_widget(**size)

This is pretty similar to what we had already, but now the waypoint is explicitly listed as part of the `SankeyDefinition`, we have more control over it.

For example, we can put the exports above James and Mary by changing the ordering:

In [None]:
ordering = [
    ['farms'],
    ['exports', 'customers'],
]
sdd = SankeyDefinition(nodes, bundles, ordering)
weave(sdd, flows).to_widget(**size)

Or we can partition the exports Waypoint to show how much of it is apples and bananas:

In [None]:
fruits_by_type = Partition.Simple('type', ['apples', 'bananas'])
nodes['exports'].partition = fruits_by_type
weave(sdd, flows).to_widget(**size)

### Horizontal bands

Often, import/exports and loss flows are shown in a separate horizontal "band" either above or below the main flows. We can do this by modifying the `ordering` a little bit.

The `ordering` style we have used so far looks like this:

```python
ordering = [
    [list of nodes in layer 1],  # left-hand side
    [list of nodes in layer 2],
    ...
    [list of nodes in layer N],  # right-hand side
]
```

But we can add another layer of nesting to make it look like this:

```python
ordering = [
    # |top band|  |bottom band|
    [ [........], [...........] ],  # left-hand side
    [ [........], [...........] ],
    ...
    [ [........], [...........] ],  # right-hand side
]
```

Here's an example:

In [None]:
ordering = [
    [[],          ['farms'    ]],
    [['exports'], ['customers']],
]
sdd = SankeyDefinition(nodes, bundles, ordering)
weave(sdd, flows).to_widget(**size)

### Summary

- All the flows to/from a ProcessGroup are shown, even if the other end of the flow is outside the system boundary (i.e. not part of any ProcessGroup)
- You can control the automatic flows by explicitly adding Bundles to/from `Elsewhere` with a `Waypoint`
- The `ordering` can contain horizontal bands

## Part 3 - Colour-intensity scales

In this tutorial we will look at how to use colours in the Sankey diagram. We have already seen how to use a palette, but in this tutorial we will also create a Sankey where the intensity of the colour is proportional to a numerical value.

In [None]:
"""Import data"""
df1 = pd.read_csv('example_data/holiday_data.csv')
display(df1)

Now take a look at the dataset we are using. This is a very insightful [made-up] dataset about how different types of people lose weight while on holiday enjoying themselves.

In [None]:
df1['value'] = df1['Calories Burnt']
dataset = Dataset(df1)

We now define the partitions of the data. Rather than listing the categories by hand, we use `np.unique` to pick out a list of the unique values that occur in the dataset.

In [None]:
partition_job = Partition.Simple('Employment Job', np.unique(df1['Employment Job']))
partition_activity = Partition.Simple('Activity', np.unique(df1['Activity']))

In fact, this is pretty common so there is a built-in function to do this:

In [None]:
# these statements or the ones above do the same thing
partition_job = dataset.partition('Employment Job')
partition_activity = dataset.partition('Activity')

We then go on to define the structure of our sankey. We define nodes, bundles and the order. In this case its pretty straightforward:

In [None]:
nodes = {
    'Activity': ProcessGroup(['Activity'], partition_activity),
    'Job': ProcessGroup(['Employment Job'], partition_job),
}

bundles = [
    Bundle('Activity', 'Job'),
]

ordering = [
    ['Activity'],
    ['Job'],
]

Now we will plot a Sankey that shows the share of time dedicated to each activity by each type of person. 

In [None]:
# These are the same each time, so just write them here once
size_options = dict(width=500, height=400,
                    margins=dict(left=100, right=100))

sdd = SankeyDefinition(nodes, bundles, ordering)
weave(sdd, dataset).to_widget(**size_options)

We can start using colour by specifying that we want to partition the flows according to type of person. Notice that this time we are using a pre-determined palette. 

You can find all sorts of palettes [listed here](https://jiffyclub.github.io/palettable/colorbrewer/qualitative/).

In [None]:
sdd = SankeyDefinition(nodes, bundles, ordering, flow_partition=partition_job)

weave(sdd, dataset, palette='Set2_8').to_widget(**size_options)

Now, if we want to make the colour of the flow to be proportional to a numerical value. Use the `hue` parameter to set the name of the variable that you want to display in colour. To start off, let's use "value", which is the width of the lines: wider lines will be shown in a darker colour.

In [None]:
weave(sdd, dataset, link_color=QuantitativeScale('value')).to_widget(**size_options)

More information is available in the [floweaver tutorial](https://floweaver.readthedocs.io/en/latest/tutorials/colour-scales.html) but current re-development of the `measures` input mean that the tutorial is slightly outdated.

## Part 4 - Adding value to flow labels

In [None]:
import re

def get_Evalues_to_target(flows, process):
#     change to list comprehension
    value = round(sum(flows.loc[flows.target == process, 'value']), 1)
    return ' (' + str(value) + ' kcal)'

def break_string(x, words = 4):
    spaces = [i.start() for i in re.finditer(' ', x)]
    if len(spaces) >= words:
        return x[0].upper() + x[1:spaces[words - 1]] + '\n' + x[spaces[words - 1]+1:]
    else:
        return x[0].upper() + x[1:]


In [None]:
df1.head()

In [None]:
# Inspecting sdd

sdd

In [None]:
# The tuple we want to change just needs to have the right structure
tuple([Group(break_string(i, words=2) + get_Evalues_to_target(df1, i), (('Employment Job', (i,)),)) 
                                                        for i in df1['Employment Job'].unique()
                                                            ])

In [None]:
# Updating the size
size_options = dict(width=500, height=400,
                    margins=dict(left=100, right=200))

partition_job = dataset.partition('Employment Job')
partition_activity = dataset.partition('Activity')

nodes = {
    'Activity': ProcessGroup(['Activity'], partition = partition_activity),
    'Job': ProcessGroup(['Employment Job'], partition = Partition(
                                                    tuple([Group(break_string(i, words=5) + get_Evalues_to_target(df1, i), 
                                                              (('Employment Job', (i,)),)) 
                                                        for i in df1['Employment Job'].unique()
                                                            ])), )
}

sdd = SankeyDefinition(nodes, bundles, ordering, flow_partition=partition_job)

weave(sdd, dataset, palette='Set2_8').to_widget(**size_options)