# Practice Session 03: Management of networks data

In this session we will study an application of complex networks analysis to cooking. We will start with the *flavors network*, a bi-partite network connecting culinary ingredients to flavour compounds [*].

The initial dataset, prepared by [Ling Cheng in 2016](https://github.com/lingcheng99/Flavor-Network), contains three files:

* `ingredients.csv` -- information about culinary ingredients
* `compounds.csv` -- information about flavour compounds
* `ingredient-compound.csv` -- flavour compounds present in each culinary ingredient
* `recipes.csv` -- ingredients used in recipes around the world (used only for extra points)


[*] Ahn, Y. Y., Ahnert, S. E., Bagrow, J. P., & Barabási, A. L. (2011). [Flavor network and the principles of food pairing](https://doi.org/10.1038/srep00196). Scientific reports, 1(1), 1-7.


<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Josip Hanak</font>

E-mail: <font color="blue">josip.hanak@fer.hr</font>

Date: <font color="blue">10/5/2022</font>

# 1. The flavors bi-partite graph

## 1.0. Examine your input files

Before you begin, we highly recommend you to:

1. Copy the input files to a local directory in your computer 
2. Open them in a spreadsheet and look at them

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 1.1. Read the bipartite graph in a dataframe


The following code, which you can leave as-is, reads the ingredient-compound relationship into a dataframe.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [72]:
# Feel free to add imports if you need them

import io
import csv
import pandas as pd
import networkx as nx

from networkx.algorithms import bipartite

import numpy as np
import matplotlib
import scipy

import itertools

from IPython.display import Image

In [73]:
# Leave this code as-is

INPUT_INGR_FILENAME = "ingredients.tsv"
INPUT_COMP_FILENAME = "compounds.tsv"
INPUT_INGR_COMP_FILENAME = "ingredient-compound.tsv"

In [74]:
# Leave this code as-is

ingredients = pd.read_csv(INPUT_INGR_FILENAME, sep="\t")
display(ingredients.head(3))

compounds = pd.read_csv(INPUT_COMP_FILENAME, sep="\t")
display(compounds.head(3))

ingr_comp = pd.read_csv(INPUT_INGR_COMP_FILENAME, sep="\t")
display(ingr_comp.head(3))


Unnamed: 0,# id,ingredient name,category
0,0,magnolia_tripetala,flower
1,1,calyptranthes_parriculata,plant
2,2,chamaecyparis_pisifera_oil,plant derivative


Unnamed: 0,# id,Compound name,CAS number
0,0,jasmone,488-10-8
1,1,5-methylhexanoic_acid,628-46-6
2,2,l-glutamine,56-85-9


Unnamed: 0,# ingredient id,compound id
0,1392,906
1,1259,861
2,1079,673


## 1.2. Create the flavors bipartite network


Create a new dataframe named `flavors`, containing three (and only three) columns named `ingredient` (name of ingredient), `ingredient_category` (name of the category of the ingredient) and `compound` (name of compound). 

The dataframe should be ordered first by `ingredient`, then by `compound`.

*Tips:*

* To [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) a DataFrame A containing columns *x*, *y* and a DataFrame B containing columns *y*, *z*, on column *y*, you can do: `C = A.set_index('y').join(B.set_index('y'))`. 
* To [rename columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) in a DataFrame A, for instance column *x* to *u* and column *y* to *v*, you can do: `A = A.rename(columns={"x": "u", "y": "v"})`
* To [drop column](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) x from DataFrame A, you can do: `A = A.drop(columns=['x'])`
* To [sort](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) a DataFrame A by column *x*, then by column *y*, you can do: `A = A.sort_values(['x', 'y'])`
* To [reset the index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) of a DataFrame A, you can do: `A = A.reset_index(drop=True)`; the index is the column appearing in boldface in front of every row of a DataFrame

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the `flavors` dataframe, representing a bi-partite network of ingredients and compounds</font>

In [75]:
first = ingredients.set_index("# id").join(ingr_comp.set_index("# ingredient id"))
flavors = first.set_index("compound id").join(compounds.set_index("# id"))
flavors = flavors.drop(columns=["CAS number"]).rename(columns={"ingredient name":"ingredient","category":"ingredient category"})
flavors = flavors.rename(columns={"Compound name":"compound"}).sort_values(["ingredient","compound"])
print(flavors.head(3))


                   ingredient ingredient category        compound
906.0              abies_alba               plant  bornyl_acetate
861.0  abies_alba_pine_needle               plant          maltol
673.0      abies_balsamea_oil    plant derivative         myrcene


Write this dataframe to a `flavors.tsv` file, which should be a tab-separated file containing the three fields `ingredient`, `ingredient_category` and `compound`. Use the function [pandas.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to save *flavors* into a tab-separated file.</font>

In [16]:
flavors.to_csv("flavors.csv")

## 1.3. Open this bi-partite network in Cytoscape


### 1.3.1. Examine the file you generated

Open the ``flavors.tsv`` file in a spreadsheet program to make sure you generated it correctly; it should have exactly 3 comma-separated columns.

### 1.3.2. Import this file in Cytoscape

Remember these files are imported with ``File > Import > Network from File ...``. Then, you have to select:

* ingredient as a ``Source Node``
* ingredient_category as a ``Source Node Attribute``.
* compound as a ``Target Node`` 

### 1.3.3. Draw a small part of this graph

Find the `watermelon` node and everything connected to it at distance 1 or 2. To do this, find "watermelon" and then click on the "two-houses" (neighbors) icon twice. Extract the selected nodes as a sub-graph by doing `File > New network > From selected nodes, all edges`.

Run the network analyzer and then perform `Layout > Edge weighted spring layout` using edge betweenness.

Style the network so that ingredient nodes have a color that depends on their category, using any color except black, and setting black to be the default node color so that compound nodes remain in color black. Set the label color to white.

Save the image as `flavors.png`; the next cell should display it.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [76]:
# Adjust width/height if necessary

Image(url="flavors.png", width=1200)

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. Include your answer to at least these three questions: (1) What ingredient groupings do you observe? (2) What connects different groupings?</font>

In [None]:
(1) I see three very distinct and large ingredient groupings (and many small ones). Two of the ingredient groups are meat based ingredient and the other one is plant based ingredients.
(2)They are connected with other parts of the graph by compounds (which are present in all the ingredients) which is visible by the black color of the node.

# 2. The ingredient-ingredient graph

The bi-partite flavors graph is hard to visualize as it mixes ingredients and compounds. We will now try to visualize only the connections between ingredients.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>


## 2.1. Create an ingredient-ingredient.csv file


First, copy the list of ingredients into an array `ingredients_array`. To convert column *x* of DataFrame *A* to an array, use `np.asarray(A['x'])`.

Then, create a dictionary named `ingredient_to_compounds`, in which keys are ingredients, and values are sets of compounds. To create an empty set, you can use `s = set()`. To add to a set, you can do `s.add(element)`. Your code should look like this:

```python
ingredients_array = ...
print("There are %d ingredients" % (len(ingredients_array)))

ingredient_to_compounds = {}

for index, row in flavors.iterrows():
    ...

```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create `ingredients_array` with the list of ingredients, and to create dictionary `ingredient_to_compounds` with a set of compounds for each ingredient</font></font>

In [77]:
ingredients_array = np.sort(np.asarray(ingredients['ingredient name']))
ingredient_to_compounds = dict()
for i in range(len(ingredients_array)):
    ingredient_name = ingredients_array[i]
    aux_df = flavors.loc[flavors['ingredient'] == ingredient_name]
    compound_list = set(aux_df.compound.tolist())
    ingredient_to_compounds[ingredient_name] = compound_list    
print(ingredient_to_compounds)    

                
    

{'abies_alba': {'bornyl_acetate'}, 'abies_alba_pine_needle': {'maltol'}, 'abies_balsamea_oil': {'myrcene'}, 'abies_canadensis': {'bornyl_acetate'}, 'abies_concolor': {'bornyl_acetate'}, 'abies_sibirica': {'isoborneol', 'camphene', 'bornyl_acetate'}, 'abies_sibirica_oil': {nan}, 'acacia': {'benzyl_acetate', 'methyl_benzoate', 'l-arabinose', '(e)-2-hexenyl_hexanoate', 'camphene', 'isoeugenol', 'eugenol'}, 'acacia_caven': {'benzyl_alcohol', 'methyl_salicylate'}, 'acacia_dealbata': {'heptanoic_acid'}, 'acacia_farnesiana': {'p-cresol', 'p-methoxybenzaldehyde'}, 'acacia_farnesiana_oil': {'p-methoxybenzaldehyde'}, 'acacia_flower_oil': {'eugenol'}, 'achasma_walang': {'2-dodecenal'}, 'achillea_ageratum': {'butyl_alcohol', 'butyl_anthranilate', 'isoamyl_alcohol'}, 'achillea_micrantha_oil': {'eucalyptol'}, 'achillea_millefolium': {'a-pinene'}, 'aconitum_napellus': {'aconitic_acid'}, 'agaricus': {'benzyl_formate'}, 'agarwood': {'4-(p-methoxyphenyl)-2-butanone'}, 'agastache_formosana': {'pulegone'}

Next, we create a NetworkX graph with nodes representing ingredients and edges of weight `x` connecting two ingredients having `x` flavor compounds in common.

To create an empty graph, do `ingredient_ingredient = nx.Graph()`.

Now, iterate through all pairs of ingredients in `ingredients_array` and compute the compounds they have in common between them. To iterate through all pair combinations of an array X, you can use:

```
for u, v in itertools.combinations(X,2):
    ...

```

The size of the intersection of two lists `l1`, `l2` can be obtained with `len(l1.intersection(l2))`. To facilitate visualization, we will keep only edges connecting two ingredients having **M or more compounds in common**. Set the value of **M** so that the resulting graph has somewhere around 150 +/- 30 nodes.

To add to graph *G* an edge between nodes *u* and *v* having weight *w*, do `G.add_edge(u, v, weight=w)`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the `ingredient_ingredient` graph</font></font>

In [89]:
ingredient_ingredient = nx.Graph()
for u, v in itertools.combinations(ingredients_array,2):
    x = len(ingredient_to_compounds[u].intersection(ingredient_to_compounds[v]))
    if (x>80):
        ingredient_ingredient.add_edge(u, v, weight=x)


In [90]:
# Leave as-is
print("The ingredient-ingredient graph has %d nodes and %d edges" %
      (ingredient_ingredient.number_of_nodes(), ingredient_ingredient.number_of_edges()))

The ingredient-ingredient graph has 145 nodes and 1277 edges


Save the resulting graph into a file. You can use [write_gml](https://networkx.org/documentation/stable/reference/readwrite/generated/networkx.readwrite.gml.write_gml.html#networkx.readwrite.gml.write_gml) to use the *GML* format.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [93]:
OUTPUT_INGR_INGR_FILENAME = 'ingredient-ingredient.gml'

<font size="+1" color="red">Replace this cell with your code to save graph G to file OUTPUT_INGR_INGR_FILENAME</font>

In [94]:
nx.write_gml(ingredient_ingredient, OUTPUT_INGR_INGR_FILENAME)

## 2.2. Work with this file in Cytoscape

## 2.2.1. Inspect this file

*Tip:* Open the ``ingredient-ingredient.gml`` file in a text editor first to see how it is structured.


## 2.2.2. Import this file into Cytoscape

To import this file into Cytoscape:

* `File > Import > Network from file ...`
* Open the `ingredient-ingredient.gml` file

Now we need to import ingredient categories:

* `File > Import > Table from file ...`
* Open the `ingredients.tsv` file
* Import data as "Node Table Columns"
* `ingredient name`: key
* `category`: attribute

Do a `Layout > Edge weighted spring embedded` layout on the *weight* attribute.

### 2.2.3. Style and add simple annotations

Style lines connecting nodes so their thickness and color reflects the number of compounds in common.

Color the nodes with colors representing the class of ingredients. Note that if you right-click on "Mapping type" when creating a discrete mapping, you can use an automatic mapping generator to start with.

Add text annotations (secondary button > add > text annotation) to the first, second, and third largest connected component, with your observations (e.g., "The second largest component is dominated by ingredients of type x"). Place the annotations next to the components they refer to (secondary button > edit > move annotation).

Save the main connected component of this graph with its legend as `ingr-ingr.png` using `File > Export > Network to image ...`.

The next cell should display your graph with its legend.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [98]:
# Change width if necessary

display(Image(url="ingr-ingr.png", width=1200))

display(Image(url="ingr-ingr-legend.gif", width=400))

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. Describe 2-3 ingredients that have interesting connections or lack of connections. plus any other interesting observations that you can make about this graph.</font>

In [None]:
In this network I see ingredients in a network containing multiple connected components. In the smaller components it is
easily noticed why they are connected between each other while in the largest connected components certain nodes 
have more common edges (because of homophily) forming subgraphs and connected to others with interesting edges. 
As an example soy beans node, which is the only node in the "beans" family who has outward connections, is connected to 
bantu bear which then has connections to the family of cheese ingredients.

# DELIVER (individually)

Read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/networks-science-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook
* The ``flavors.tsv`` and ``flavors.png`` files
* The ``ingredient-ingredient.tsv`` and ``ingr-ingr.png`` files

## Extra points available

For more learning and extra points, get the `recipes.csv` file. It contains one recipe per line, in this format:

```
EastAsian,roasted_sesame_seed,garlic,cayenne,seaweed,sesame_oil
```

This means there is one East Asian dish whose recipe requires the ingredients "roasted_sesame_seed", "garlic", "cayenne", "seaweed", and "sesame_oil".

Select 5 recipes and draw using Cytoscape one bi-partite graph for each, with the ingredients and their compounds in each recipe. Include those bi-partite graphs here, plus a brief commentary about whether the ingredients used share many compounds, few compounds, or not at all, and any other observations you want to make about the selected recipes.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: ingredients and compounds of five recipes</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>
