# Visualising connections between cooking ingredients using graphs

   In the process of creating my "cooking-ingredient-pairing-suggestions" <b>Telegram bot</b>, I gathered a big dataset of <i>ingredient + its pairing + number of occurrances</i> (data is from <a href="https://www.allrecipes.com">allrecipes.com</a>). 
   Of course, most ingredients appear in the dataset multiple times, forming a variety of pairings with each other. Basically, these foods comprise a network, where commonly used together ingredients are connected. Therefore, we'll create a graph to visualise this network.

In [31]:
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
from pyvis.network import Network
import community as community_louvain

from database import engine

First, extracting data from the database.

In [32]:
pairs_db = pd.read_sql(sql='ingredients', con=engine)
pairs_db

Unnamed: 0,id,ingredient,pairing,count
0,1,cream cheese,white sugar,1
1,2,cream cheese,heavy cream,2
2,4,cream cheese,raspberry pie filling,1
3,5,white sugar,cream cheese,1
4,6,white sugar,heavy cream,2
...,...,...,...,...
11338,13933,bread crumbs,pork chop,1
11339,13934,olive oil,pork chop,1
11340,13935,pork chop,egg,1
11341,13936,pork chop,bread crumbs,1


Because our graph is not directed, the data shoud not contain dublicates.
So, in order to create a dataframe representation of the graph we need to:
1. Reorder rows to find duplicates
2. Remove duplicated rows
3. Create new dataframe with this data

In [33]:
sorted_pairs = pd.DataFrame(np.sort(pairs_db[['ingredient', 'pairing']].values, axis=1), columns=["source", "target"])
sorted_pairs["weight"] = pairs_db["count"]
network_nodes = sorted_pairs.drop_duplicates()
# using only 250 most occurring pairings
network_nodes = network_nodes.sort_values("weight", ascending=False).iloc[:250] 
network_nodes

Unnamed: 0,source,target,weight
1627,butter,sugar,187
1624,flour,sugar,168
1460,egg,sugar,148
67,butter,flour,143
442,butter,egg,108
...,...,...,...
4133,egg,egg yolk,8
2028,sugar,vinegar,8
965,garlic,tomato sauce,8
1083,coconut,milk,7


Creating a graph from the dataframe: 
- each node is an ingredient
- each verticle represents that two ingredients are used in one recipe
- a weight of each verticle is a number of times two ingredients occur in a recipe
- a size of each node is determined by its degree (the number of verticles connected to the node)

In [34]:
G = nx.from_pandas_edgelist(network_nodes, 
                                source='source', 
                                target='target', 
                                edge_attr='weight',
                                create_using=nx.Graph())

node_degree = dict(G.degree)
nx.set_node_attributes(G, node_degree, 'size')
nx.set_node_attributes(G, '#E9F1BF', 'color')

In [35]:
# Using pyvis to display the graph in the notebook (NB: if graph displays weirdly - rerun this cell)
net = Network(notebook=True, width='900px', height='800px',
              bgcolor='#543D61', font_color='white', cdn_resources='remote')

net.from_nx(G)
net.show('ingredient_network.html')

Looking at the graph, we can see that the biggest nodes are <b>onion</b>, <b>sugar</b>, <b>butter</b>, <b>pepper</b>, <b>egg</b> and <b>garlic</b>. Taking it into account, we can assume that these ingredients are the most commonly used in cooking, which does not contradict common sense. 

To ensure that this hypothesis is right, let's explore different centrality measures of the graph nodes.
<br><br><br>

## Centrality measures

In [36]:
def display_centrality(centrality_dict, centrality_type):
    degree_df = pd.DataFrame.from_dict(centrality_dict, 
                                   orient="index", 
                                   columns=["centrality"]).reset_index().rename(columns={"index":"ingredient"})
    fig = px.bar(degree_df.sort_values("centrality", ascending=False).iloc[:10,:], 
             x="ingredient", y="centrality", 
             color="ingredient", color_discrete_sequence=(["#543D61"] + ["#A896B3"]*9), 
             title=f"Top 10 ingredients by {centrality_type} centrality")
    fig.update_layout(showlegend=False)
    fig.show()

### 1. Degree centrality
\- number of links connected to each node.
<br><br>
Shows nodes with the biggest number of "neighbours".

In [37]:
degree_dict = nx.degree_centrality(G)
display_centrality(degree_dict, "degree")

Here the top 10 ingredients are as seen as the largest nodes in the visualisation.
<br><br>

### 2. Betweenness centrality
\- number of times when a node lies on the shortest path between any other two nodes.
<br><br>
Shows how often a node can be used as a "bridge" between others.

In [38]:
betweenness_dict = nx.betweenness_centrality(G)
display_centrality(betweenness_dict, "betweenness")

Here the top 10 is the same, however centrality difference is more noticable.
<br><br>

### 3. Closeness centrality
\- sum of the shortest from a node to all other nodes.
<br><br>
Shows nodes that are the closest to others (from which the entire network can be accessed most quickly).

In [39]:
closeness_dict = nx.closeness_centrality(G)
display_centrality(closeness_dict, "closeness")

Here the top 10 is also the same, but centrality difference is very subtle. It is explained by high density of connections in the graph.
<br><br>

Centrality measures show that our idea of which ingredients are the most popular was right. 
<br>
Moreover, it seems as <b>onion</b> is the best of them all.
<br>
<br>
<br>
<br>

## Ingredient groups (communities)

Another interesting question is can our data be divided into groups. Let's build a community graph to find out.

In [40]:
# Community exploration with the Louvain method

groups = community_louvain.best_partition(G)
nx.set_node_attributes(G, groups, 'group')

community_net = Network(notebook=True, width='900px', height='800px', 
                        bgcolor='#543D61', font_color='white', cdn_resources='remote')

node_degree = dict(G.degree)
nx.set_node_attributes(G, node_degree, 'size')

community_net.from_nx(G)
community_net.show('ingredient_groups.html')

So, the community extraction resulted in two large groups, which can be interpreted as following: 
- Group 1 (blue) - ingredients mainly used in desserts and baking ("sweet"), such as butter, flour, sugar, egg. 

- Group 2 (yellow) - ingredients mainly used in main dishes ("savory"), such as onion, cheese, garlic, tomato.

- Nodes-connectors between groups - ingredients used in both domains, such as lemon, vegetable oil, sour cream.