---
# Welcome to the interflow Quickstarter!
---
## Introduction
**interflow** is an open-source python package for collecting, calculating, organizing, and visualizing cross-sectoral resource interdependencies and flows.

This Jupyter Notebook serves a ready-to-go introduction to the package by walking through two examples:
1. Using sample water and energy data for United States counties for the year 2015 included in the package.
2. Building input data from scratch and running the model with the created data.

For more information on the package, including the user guide, the methodology behind the sample data, and other information, please visit [the interflow documentation](https://pnnl.github.io/interflow/index.html)

To visit the interflow GitHub repository, [go here](https://github.com/pnnl/interflow)

## Importing the package <a class="anchor" id="chapter1"></a>

Click on the cell below and hit ctrl-enter to import the interflow package and its modules.

In [None]:
# import the package
import interflow

# Example 1: Using the US Sample Data
This section shows how to load and use the sample input data provided in the package. For a walkthrough on how to build the a dataset from scratch, see [Example 2](#Example2).

## 1.1 Loading sample data
The interflow package comes with sample data for all counties in the United States for the year 2015. To load the sample input data, run the cell below.

In [None]:
# read in the sample data
data_input = interflow.read_sample_data()

## 1.2 Running the model
Now that our input data is prepared, we can run some or all of it through the model to start collecting, computing, and organizing our water and energy flows.

#### Selecting a Region
The US sample data comes with data for analyzing over 3,000 different US counties. The interflow package is capable of running an individual county at a time or the entire dataset of counties. The cell below selects a single county to run through the model. The counties are presented here under their Federal Information Processing Standards (FIPS) code rather than a name. The cell below sets the region for analysis equal to the FIPS code for New York County, NY (36061). 

To select other counties from this dataset, any FIPS code can be chosen from the input dataset or retrieved from the list presented here: [County FIPS List](https://pnnl.github.io/interflow/county_list.html)

#### Running the model for a single region
Run the cell below to run the model for the select region

In [None]:
# set the region equal to the FIPS code for New York County, NY
region = '36061'

In [None]:
# run the model for the select region
output = interflow.calculate(data=data_input, region_name=region)

## 1.3 Observing the output dataset
The output dataset is a Pandas DataFrame of flow values between the source sector node (Columns S1 through S5) to the target sector node (Columns T1 through T5) in indicated units. The number after S or T indicates the level of sector granularity where S1 is the major source sector name, S2 is the subsector/application under that sector for that row.

The cell below shows the first five rows of the output. We can read the first row of output as the flow value between the S (source) node to the T (target) node for the county specified.

In [None]:
output.head()

The US sample data uses acronyms for the major sector names. The table in the appendix of this quickstarter shows the definition of each of these abbreviations.

## 1.4 Visualizing the flows between sectors
The output dataset itself provides the values between nodes for both water and energy, however, it is not very intuitive on its own for understanding the relationships between nodes and how resources pass from one to the next. The various visualization tools integrated into the model can help with this.

### 1.4.1 Sankey Diagrams
Sankey diagrams show flows between nodes and are able to represent how resources are passed along in a network. The cell below will produce two sankey diagrams with the sample data run output for the indicated region, one for water flows (given in million gallons per day) and one for energy flows (given in billion british thermal units per day).

Only one region can be shown at a time. To see the sankey diagrams for an alternative county, change the county code  up above and re-rerun the .calculate() function cell to update the output that is fed into the cell below.

The sankey diagrams are capable of being produced at different levels of granularity. The 'output_level' parameter in the '.plot_sankey()' function adjusts this. The output_level has been set to level 1 below to start to show the lowest level of granularity available. Changing this value to an integer between 1 and 5 inclusive will change the diagram to split out flows to that level of granularity.

For more information on this output, see the [key outputs documentation](https://pnnl.github.io/interflow/user_guide.html#single-unit-sankey-diagrams)

In [None]:
# plot sankey diagrams for water and energy
viz = interflow.plot_sankey(data=output, region_name= region, 
                       unit_type1 = 'mgd', unit_type2='bbtu', output_level=1, strip='total')

### 1.4.2 Stacked Sector Bar Charts
In addition to sankey diagrams which show the flows from sector to sector, it's useful to see the flow breakdown within each sector. For more information on this output, see the [key outputs documentation](https://pnnl.github.io/interflow/user_guide.html#single-region-stacked-barcharts-of-sectors)

#### Inflow bar charts
The plot_sector_bar() function allows us to see the breakdown of inflows or outflows to a sector broken up by its subsectors/applications. Setting inflow equal to True will display the values by subsector for each sector in the specified unit. Additionally, the chosen units can be adjusted. The code below is currently set to display energy (bbtu) flows for the given county. Changing the 'unit_type' parameter to 'mgd' for the sample data will show the water flow values instead for the indicated sectors.

The sectors shown below include the electricity generation sector (EGS) and the residential sector (RES). To adjust the list of sector included for the chosen county, see the acronym list at the end of this notebook.

In [None]:
# create a list of sectors that you want to see a barchart of
sectors = ['EGS', 'RES']

In [None]:
# plot a stacked barchart of inflows to the specified sectors
interflow.plot_sector_bar(data=output, unit_type='bbtu', region_name =region, 
                     sector_list=sectors, inflow=True, strip='total')

#### Outflow barcharts
To observe where outflows from the sector as a whole end up you can set the inflow parameter to False, as shown below.

In [None]:
# plot a stacked barchart of inflows to the specified sectors
interflow.plot_sector_bar(data=output, unit_type='bbtu', region_name =region, 
                     sector_list=sectors, inflow=False, strip='total')

### 1.4.3 Regional Shaded Maps
Now that we've looked at the values across all sectors within a specific region and the values within specific sectors in a  region, we can additionally look at how values compare across all regions.

The .plot_map() function generates a choropleth map where the included regions are shaded according to the value of the chosen flow. In the above visualization examples we've only run the model for one of the 3,000+ regions available. The .plot_map() function will only display regions you give it. Therefore, running the map with our current output would only show one region shaded. To shade all counties in the US, the full run output for all counties needs to be supplied.

To avoid the computation time required to run the model for all 3,000+ counties here, the full output for all counties has been created and stored in the repository files and can be loaded below.

Note that this function additionally requires a geoJSON datafile which is also included in the package datafiles for US counties. 

For more information on this visualization, visit the [key output documentation](https://pnnl.github.io/interflow/user_guide.html#choropleth-map-displaying-single-flow-values-across-regions)

#### Load map data

In [None]:
# load full sample data output for all counties
full_output = interflow.load_sample_data_output()

# load GeoJSON file of counties
geo = interflow.load_sample_geojson_data()

#### Generate Choropleth Map
The choropleth map comes with a dropdown menu of flow values from node to node. Selecting a new value will update the map. Additionally, the map can be generated for various levels of data granularity which will update the dropdown menu to reflect this. The map is currently configured to display flow values at level 2 granularity.

In [None]:
# plot flow values in a choropleth map at level 2 granularity
interflow.plot_map(data=full_output, jsonfile = geo, level=2)

----------------
<a id="Example2"></a>
# Example 2: Building and Using Datasets from Scratch
This section will show you how to configure your input data from the ground up to use interflow for your own analysis. This walkthrough gives similar information to the details provided in the [Generalizability Section](https://pnnl.github.io/interflow/user_guide.html#generalizability) of the interflow documentation. For additional details, refer to the documentation.

## 2.1 Creating input data

At its most basic level interflow collects input values connecting two sectors in specified units (e.g., water delivered to the agriculture sector), calculates additional sector flows in alternative units based on intensity factors (e.g., energy demand based on water delivered to the agriculture sector), and builds connections to and from additional sectors to carry those values (e.g., electricity sector connected to agriculture sector to deliver the energy)

The steps below walk through how to create the necessary input data to conduct the above calculations.

### 2.1.1 Creating known flow value input data

We can start by creating a very simple input DataFrame of a single known flow value. The list of lists (inputs) created below will be used to create the DataFrame where each list in the list will be a row of input data. Note that, since this first example is for a known pre-existing flow value, the calculation type is set to 'A_collect'. 

The example below is creating input data for a region called 'Example1'. The region name must be the first data item, followed by the calculation type. The third through seventh data items are the level 1 through level 5 granularity naming for the primary node. Primary node in this context means the **recieving** node/sector. The eight data item is the unit type associated with the Primary node, in this case "gal". Data items nine through fourteen give information for the Secondary node. In this context this is the **sending** node/sector. 

The code below will tell the model to collect and store a known flow value in gal going from water supply to agriculture. More specifically, it will store the amount of water going from the fresh surface water supply from glacial lakes to the fresh surface water that's used in the crop irrigation of rice in agriculture.

A few things to keep  in mind:
* The list containing the flow information must match the order and formatting below and value must be provided for each data position or an error will be raised.
* Sector and region names should not include the underscore character "_" . This will cause the model treat them as separate items and raise an error.
* The column headers used below must remain as shown.

In [None]:
# import pandas to create the Pandas DataFrame
import pandas as pd

In [None]:
# create an example known flow value from the secondary to the primary node
inputs = [
    ['Example1',                                                              # Region name information
     'A_collect',                                                             # Calculation type
     'Agriculture', 'Fresh', 'Surface', 'Crop', 'Rice', 'gal',                # Primary Node information
     'WaterSupply', 'Fresh', 'Surface', 'Lake', 'Glacial', 'gal',             # Secondary Node information
     'flow_value', 1000]]                                                     # value of known flow               

# convert list to dataframe with the required column names
column_names = ['region', 'type', 't1', 't2', 't3', 't4', 't5', 'T_unit', 
                's1', 's2', 's3', 's4', 's5', 'S_unit', 'parameter', 'value']
data = pd.DataFrame(inputs, columns=column_names)

# show dataframe
data.head()

### 2.1.2 Adding intensity values to do cross-unit calculations

Let's build on the dataset we've created above by adding some intensity values to do cross-unit calculations.

Let's say we wanted to have the model calculate the amount of energy it takes to pump the 1000 gal of water in the agriculture sector and we know the energy intensity (energy required per unit of water) of pumping the water in this region is 5. To provide this information and tell the model to do this calculation, we need another row of data in our input data file provided below.

For these data rows, the primary node information is not an existing target node but a **new** node you're creating. In this case we want to create a new energy node for agriculture irrigation pumping for rice crop. 

The secondary node information in these data rows is the name of the node with the water value that we want the model to calculate the energy off of. In the case below we are telling the model to calculate the energy associated with the fresh surface water used in the crop irrigation of rice. The intensity parameter indicates that for every gal of fresh surface water used in the rice crop agriculture sector, 5 MWh of energy is required for pumping.

Notice when we recreate the input DataFrame with the added information we now have two rows of data, one for the known flow value and one for the energy intensity value.

In [None]:
# create a new list with intensity calculation information
intensity_data =['Example1',                                                  # Region name information
     'B_calculate',                                                           # Calculation type
     'Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh',           # Primary Node information
     'Agriculture', 'Fresh', 'Surface', 'Crop', 'Rice', 'gal',                # Secondary Node information
     'intensity', 5]                                                          # intensity value               

# append the new data list to the existing input data list
inputs.append(intensity_data)

# rebuild the input list to a dataframe with the appropriate column names
data = pd.DataFrame(inputs, columns=column_names)
data.head()

### 2.1.3 Adding source and discharge fraction data
Now that we have our known flow value and we have the intensity value to calculate a cross-sector energy flow, we need to tell the model where it should bring the calculated flow from and where we want to discharge it to (if desired).

Adding on to the example above, we can create three new data lists to accomplish this. The first two data lists created below (source_data1 and source_data2) are telling the model where we want the calculated total energy in pumping for rice crop irrigation to be coming from. In this case, we want 30% to come from the electricity node and 70% from the natural gas node. A new data row will need to be added for each source the energy comes from.

Notice that we don't have to have sub-levels of granularity for nodes and we can fill these in with a placeholder. Here, "total" has been used.

The same process is used for telling the model to discharge the energy (i.e., send to a downstream node). We can add as many downstream nodes to discharge to as desired. Here, for simplicity, just one has been given with 100% of the energy used in rice crop irrigation being discharged to EnergyServices.

In [None]:
# create a new list with source fraction from electricity
source_data1 =['Example1',                                                    # Region name information
     'C_source',                                                              # Calculation type
     'Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh',           # Primary Node information
     'Electricity', 'total', 'total', 'total', 'total', 'MWh',                # Secondary Node information
     'fraction', .30]                                                         # source fraction

# create a new list with source fraction from natural gas
source_data2 =['Example1',                                                    # Region name information
     'C_source',                                                              # Calculation type
     'Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh',           # Primary Node information
     'NaturalGas', 'total', 'total', 'total', 'total', 'MWh',                 # Secondary Node information
     'fraction', .70]                                                         # source fraction            

# create a new list with discharge fraction information to energy services
discharge_data =['Example1',                                                  # Region name information
     'D_discharge',                                                           # Calculation type
     'Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh',           # Primary Node information
     'EnergyServices', 'total', 'total', 'total', 'total', 'MWh',             # Secondary Node information
     'fraction', 1]                                                           # discharge fraction       

In [None]:
# append the new data source and discharge fraction data to the existing input data list
inputs.append(source_data1)
inputs.append(source_data2)
inputs.append(discharge_data)

# convert input list to dataframe with the appropriate column names
data = pd.DataFrame(inputs, columns=column_names)

# show the full input dataframe
data.head()

## 2.2 Run calculations with created input data file

Now that we have an input data file with some known flow data, an intensity value, and source and discharge fractions, we can run the model to calculate the resulting flows.

In [None]:
# run the model
output = interflow.calculate(data=data)

# display the output dataframe
output.head()

As expected, given the 1000 gallons of water and 5 MWh/gallon intensity value, the total energy for irrigation pumping for rice crop in agriculture is 5000 MWh. We've told the model that 30% of this energy should come from electricity and 70% from natural gas, which gives us the 1500 MWh and 3500 MWh shown in the output, respectively. We've told the model that 100% of the energy used in rice crop irrigation pumping should be discharged to the Energy Services node, giving us the 5000 to Energy Services.

## 2.3 Visualizing the output

Running the cell below will show a sankey diagram of the calculated flows between nodes

In [None]:
interflow.plot_sankey(data=output, unit_type1 = 'gal', unit_type2 = 'MWh', 
                      region_name = 'Example1', output_level=5, strip='total')

## Note:
The example walked through above was done step by step for each type of data row creation. In reality, these can be created in one step and obtain the same result, as shown below:

In [None]:
inputs = [
    ['Example1','A_collect','Agriculture', 'Fresh', 'Surface', 'Crop', 'Rice', 'gal',
     'WaterSupply', 'Fresh', 'Surface', 'Lake', 'Glacial', 'gal','flow_value', 1000],
    ['Example1','B_calculate','Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh',
     'Agriculture', 'Fresh', 'Surface', 'Crop', 'Rice', 'gal', 'intensity', 5],
    ['Example1','C_source', 'Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh',
     'Electricity', 'total', 'total', 'total', 'total', 'MWh','fraction', .30],
    ['Example1','C_source','Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh',
     'NaturalGas', 'total', 'total', 'total', 'total', 'MWh', 'fraction', .70],
    ['Example1','D_discharge','Agriculture', 'Irrigation', 'Pumping', 'Crop', 'Rice', 'MWh', 
     'EnergyServices', 'total', 'total', 'total', 'total', 'MWh','fraction', 1]]

# convert input list to dataframe with the appropriate column names
data = pd.DataFrame(inputs, columns=column_names)

# run the model
output = interflow.calculate(data=data)

# display the output dataframe
output.head()

---
## Appendix

### 1. Useful Links

#### [Interflow GitHub Repository](https://github.com/pnnl/interflow)

#### [Interflow Documentation](https://pnnl.github.io/interflow/)

#### [Sample Data Methodology and References](https://pnnl.github.io/interflow/sample_data.html)


### 2. Sample Data Acronym Guide

| Acronym | Description|
| --- | --- |
| AGR | Agriculture Sector|
| CVL | Conveyance Losses |
| COM | Commercial Sector |
| CMP | Consumption/Evaporation |
| EGS | Electricity Generation Supply |
| EPD | Energy Production Demand |
| ESV | Energy Services |
| GWD | Ground Discharge |
| IND | Industrial Sector |
| INX | Discharge to Industrial Sector |
| IRX | Discharge to Irrigation |
| MIN | Mining Sector |
| OCD | Ocean Discharge |
| PRD | Produced Water |
| PWD | Public Water Demand |
| PWI | Public Water Imports |
| PWS | Public Water Supply |
| PWX | Public Water Exports |
| REJ | Rejected Energy |
| RES | Residential Sector |
| SRD | Surface Discharge |
| TRA | Transportation Sector |
| WSW | Water Supply Withdrawals |
| WWD | Wastewater Treatment |
| WWI | Wastewater Imports |
| WWS | Wastewater Supply |
