# Recipe Co-occurrence Network Analysis

## Project Overview

This project focuses on analyzing culinary data to construct and explore a network of recipe similarities. Recipes are connected based on the number of ingredients they share, allowing for a quantitative understanding of ingredient combinations and relationships between different dishes.

The core idea is to transform raw recipe-ingredient data into a network graph, enabling the application of network science techniques to uncover hidden structures and patterns in culinary systems.

---

## 0. Necessary Imports

This block includes all the essential Python libraries that will be used in the functions and the main script. It's good practice to place them at the beginning of your notebook to ensure all dependencies are loaded before code execution.

In [22]:
import pandas as pd
import numpy as np
import time
from scipy.sparse import csr_matrix, triu, coo_matrix
import networkx as nx

file_path = './culinaryDB/04_Recipe-Ingredients_Aliases.csv'
G = nx.Graph()

---

## 1. `sample_recipes_df` Function

This function is designed for **recipe data sampling**. It's not always necessary or efficient to work with the entire dataset, especially if it's very large. This function allows you to select a specific percentage of unique recipes from your original dataset.

### What it Does:
* Takes the complete DataFrame and the desired percentage of recipes to sample.
* Identifies all unique recipes in the DataFrame.
* Randomly shuffles the list of unique recipes.
* Selects the specified percentage of these shuffled recipes.
* Filters the original DataFrame to include only the rows related to the sampled recipes. This reduces the size of the dataset we will work with, making subsequent calculations faster.
* Prints information about the number of original and sampled recipes to track the effect of sampling.


In [23]:
def sample_recipes_df(df, percentage):
    print(f"\nApplying sampling: Keeping {percentage*100:.0f}% of unique recipes...")
    unique_recipes = df['Recipe ID'].unique()
    num_total_recipes = len(unique_recipes)
    np.random.shuffle(unique_recipes)
    sample_size = int(num_total_recipes * percentage)
    sampled_recipe_ids = unique_recipes[:sample_size]
    df_sampled = df[df['Recipe ID'].isin(sampled_recipe_ids)].copy()
    print(f"Original unique recipes: {num_total_recipes}")
    print(f"Sampled unique recipes: {len(sampled_recipe_ids)}")
    print(f"DataFrame filtered to {len(df_sampled)} rows for sampled recipes.")
    return df_sampled

---

## 2. `build_cooccurrence_matrix` Function

This is the core function for creating the **co-occurrence matrix**, which is the heart of the analysis. This matrix quantifies how many times each pair of recipes shares the same ingredients. It is optimized to work with large datasets using sparse matrices.

### What it Does:
* **Data Loading:** Reads the specified CSV file, loading only the 'Recipe ID' and 'Entity ID' columns, which are essential for matrix construction.
* **Sampling (Optional):** If a `sample_percentage` less than 1.0 is specified, it calls the `sample_recipes_df` function to reduce the dataset.
* **De-duplication:** This is a **crucial** recently added step. It removes duplicate rows based on the combination of 'Recipe ID' and 'Entity ID'. This ensures that if a recipe lists the same ingredient multiple times (as in your basil example), it is counted only once for that recipe. This prevents an artificial inflation of co-occurrence weights.
* **ID Mapping:** Converts textual/numerical recipe and ingredient IDs into sequential numerical indices (0, 1, 2, ...) required for matrix operations. It also creates a mapping to retrieve original IDs later.
* **Bipartite Matrix:** Constructs a sparse matrix (Recipes x Ingredients) where each cell (i, j) is 1 if recipe `i` contains ingredient `j`. Thanks to de-duplication, each cell will contain at most 1.
* **Co-occurrence Matrix Calculation:** Calculates the co-occurrence matrix by multiplying the bipartite matrix by its transpose (`B @ B.T`). The value `C[i,j]` in this matrix represents the number of common ingredients between recipe `i` and recipe `j`.
* **Diagonal Cleaning:** Sets the elements on the diagonal to zero (`setdiag(0)`). This is because a recipe co-occurs with itself for all its ingredients, and this value is not significant for similarity analysis.
* **Explicit Zero Elimination:** A technical step (`eliminate_zeros()`) to remove any "explicit zeros" that might have been stored in the sparse matrix. This is important for maintaining data structure consistency and preventing potential `IndexError`s.
* **Thresholding (Optional):** If a `threshold_weight` is specified (> 0), it filters the matrix to keep only connections (co-occurrences) whose weight is strictly greater than this threshold. This helps to remove weak and insignificant connections. Internally, it temporarily converts to COO format for efficient filtering and then reconstructs the matrix.
* **Upper Triangle Extraction:** Extracts only the upper part of the matrix (`triu(..., k=1)`). Since the co-occurrence matrix is symmetric and we are building an undirected graph, we only need the upper half to represent all unique connections, saving memory and processing time.
* **Return Values:** Returns the final co-occurrence matrix (filtered and upper triangular), the recipe ID-to-index map, and the NumPy array of unique (sampled) recipe IDs.


In [24]:
def build_cooccurrence_matrix(file_path, sample_percentage=1.0, threshold_weight=0, drop_duplicate_ingredients_per_recipe=True):
    """
    Builds a sparse co-occurrence matrix for recipes based on shared ingredients.
    Includes optional sampling, de-duplication and weight thresholding.
    """
    
    print(f"Loading data from {file_path}...")
    df = pd.read_csv(file_path, usecols=['Recipe ID', 'Entity ID'])

    if sample_percentage < 1.0:
        df = sample_recipes_df(df, sample_percentage)

    if drop_duplicate_ingredients_per_recipe:
        initial_rows = len(df)
        df.drop_duplicates(subset=['Recipe ID', 'Entity ID'], inplace=True)
        rows_after_dedup = len(df)
        if initial_rows > rows_after_dedup:
            print(f"Removed {initial_rows - rows_after_dedup} duplicate (Recipe ID, Entity ID) pairs.")
        else:
            print("No duplicate (Recipe ID, Entity ID) pairs found.")
    else:
        print("Skipping deduplication of ingredients per recipe as requested.")


    print("Mapping recipe and ingredient IDs to indices...")
    unique_recipe_ids = df['Recipe ID'].unique()
    recipe_to_index = {recipe_id: i for i, recipe_id in enumerate(unique_recipe_ids)}
    num_recipes = len(unique_recipe_ids)

    unique_ingredient_ids = df['Entity ID'].unique()
    ingredient_to_index = {ingredient_id: i for i, ingredient_id in enumerate(unique_ingredient_ids)}
    num_ingredients = len(unique_ingredient_ids)

    print(f"Found {num_recipes} unique recipes and {num_ingredients} unique ingredients.")

    df['recipe_idx'] = df['Recipe ID'].map(recipe_to_index)
    df['ingredient_idx'] = df['Entity ID'].map(ingredient_to_index)

    print("Creating the sparse bipartite matrix (Recipes x Ingredients)...")
    start_time = time.time()
    
    recipe_ingredient_matrix = csr_matrix(
        (np.ones(len(df), dtype=int), (df['recipe_idx'], df['ingredient_idx'])),
        shape=(num_recipes, num_ingredients)
    )

    print(f"Sparse bipartite matrix created in {time.time() - start_time:.2f} seconds.")

    print("Calculating the co-occurrence matrix via matrix multiplication (B @ B.T)...")
    start_time = time.time()
    cooccurrence_matrix = recipe_ingredient_matrix @ recipe_ingredient_matrix.T
    print(f"Co-occurrence matrix calculated in {time.time() - start_time:.2f} seconds.")

    cooccurrence_matrix.setdiag(0)
    cooccurrence_matrix.eliminate_zeros() 

    if threshold_weight > 0:
        print(f"Applying a weight threshold of > {threshold_weight} to the co-occurrence matrix...")
        
        coo_matrix_temp = cooccurrence_matrix.tocoo()
        
        rows = coo_matrix_temp.row
        cols = coo_matrix_temp.col
        data = coo_matrix_temp.data
        
        mask = data > threshold_weight
        
        filtered_rows = rows[mask]
        filtered_cols = cols[mask]
        filtered_data = data[mask]
        
        cooccurrence_matrix = coo_matrix((filtered_data, (filtered_rows, filtered_cols)), 
                                                  shape=cooccurrence_matrix.shape).tocsr()
        print(f"Number of non-zero entries after thresholding: {cooccurrence_matrix.nnz}")

    print("Extracting upper triangle of the co-occurrence matrix for graph creation efficiency...")
    cooccurrence_matrix_final = triu(cooccurrence_matrix, k=1, format='csr') 

    print("\nMatrix construction completed!")
    print(f"Original (pre-triangle extraction) co-occurrence matrix dimensions: {cooccurrence_matrix.shape}")
    print(f"Final sparse matrix (upper triangular) non-zero entries: {cooccurrence_matrix_final.nnz}")

    return cooccurrence_matrix_final, recipe_to_index, unique_recipe_ids

---

## 3. `create_network_from_cooccurrence_matrix` Function

This function takes the co-occurrence matrix (created previously) and transforms it into a **network graph** using the `NetworkX` library. A graph is an intuitive representation of relationships, where nodes are recipes and edges represent their similarity (based on shared ingredients).

### What it Does:
* Takes the `cooccurrence_matrix` (which should be a sparse matrix) and an `index_to_original_id_map` (to convert numerical node indices into readable recipe IDs) as input.
* Uses NetworkX's built-in functions (`nx.from_scipy_sparse_array` or `nx.from_numpy_array`) to quickly create a graph from the matrix data. These methods are highly optimized.
* If the matrix type is not recognized, it falls back to a manual (slower) method for adding nodes and edges.
* Once a graph with numerical nodes (indices) is created, it **relabel**s these nodes using the `index_to_original_id_map`, replacing indices with the original recipe IDs. This makes the graph more understandable.
* Prints information about the process status, including execution times and graph dimensions.
* Includes an example of printing a random edge and its weight to verify the graph structure.
* Returns the complete and relabeled NetworkX graph.

In [25]:
def create_network_from_cooccurrence_matrix(cooccurrence_matrix, index_to_original_id_map):
    """
    Transforms a weighted co-occurrence matrix into an undirected NetworkX graph.
    """
    
    print("Starting graph creation process...")
    start_total_time = time.time()

    num_recipes = cooccurrence_matrix.shape[0]

    print(f"Total unique recipes identified: {num_recipes}")
    print("Attempting to create NetworkX graph from the co-occurrence matrix...")

    graph_creation_start_time = time.time()

    if isinstance(cooccurrence_matrix, csr_matrix):
        G_indexed = nx.from_scipy_sparse_array(cooccurrence_matrix, create_using=nx.Graph, edge_attribute='weight')
    elif isinstance(cooccurrence_matrix, np.ndarray):
        G_indexed = nx.from_numpy_array(cooccurrence_matrix, create_using=nx.Graph, edge_attribute='weight')
    else:
        print("Warning: cooccurrence_matrix is not a scipy.sparse.csr_matrix or numpy.ndarray. Falling back to manual edge addition.")
        G_indexed = nx.Graph()
        for i in range(num_recipes):
            G_indexed.add_node(i)

        if hasattr(cooccurrence_matrix, 'nonzero'):
            rows, cols = cooccurrence_matrix.nonzero()
            for r_idx, c_idx in zip(rows, cols):
                if r_idx < c_idx:
                    weight = cooccurrence_matrix[r_idx, c_idx]
                    if weight > 0:
                        G_indexed.add_edge(r_idx, c_idx, weight=weight)
        else:
            for i in range(num_recipes):
                for j in range(i + 1, num_recipes):
                    weight = cooccurrence_matrix[i, j]
                    if weight > 0:
                        G_indexed.add_edge(i, j, weight=weight)

    print(f"Initial graph with integer indices created in {time.time() - graph_creation_start_time:.2f} seconds.")
    print(f"Graph has {G_indexed.number_of_nodes()} nodes and {G_indexed.number_of_edges()} edges.")

    print("Relabeling nodes from integer indices to original recipe IDs...")
    relabel_start_time = time.time()
    
    mapping_for_relabel = {idx: index_to_original_id_map[idx] for idx in G_indexed.nodes()}

    G = nx.relabel_nodes(G_indexed, mapping_for_relabel)
    print(f"Nodes relabeled in {time.time() - relabel_start_time:.2f} seconds.")

    print(f"\nGraph creation completed. Total time: {time.time() - start_total_time:.2f} seconds.")
    print(f"Final graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")

    print("Example of an edge and its weight:")
    if G.number_of_edges() > 0:
        sample_edge = next(iter(G.edges(data=True)))
        u, v, data = sample_edge
        print(f"Edge between recipe {u} and recipe {v} with weight {data['weight']}")
    else:
        print("No edges found in the graph (check data or weight threshold if applied).")

    return G

---

## 3. Executing the `build_cooccurrence_matrix` Function

This block is where the `build_cooccurrence_matrix` function is actually called with your desired parameters. It's the starting point of your analysis pipeline.

### What it Does:
* Defines the `file_path` to your dataset.
* Calls `build_cooccurrence_matrix`, passing parameters for sampling (`sample_percentage=0.20`, meaning 20% of recipes) and the weight threshold (`threshold_weight=7`). The threshold of 7 means that two recipes will be considered connected only if they share **more than 7 ingredients**.
* Saves the results returned by the function into the variables:
    * `cooc_matrix_filtered`: The resulting co-occurrence matrix.
    * `recipe_map_filtered`: The dictionary mapping recipe IDs to indices.
    * `ids_filtered`: The NumPy array of sampled recipe IDs.
* Prints the dimensions of the final matrix and the number of non-zero entries (NNZ), which represent the remaining significant connections.

In [26]:
cooc_matrix_filtered, recipe_map_filtered, ids_filtered = build_cooccurrence_matrix(file_path, sample_percentage=0.40, threshold_weight=7)
print(f"Filtered matrix shape: {cooc_matrix_filtered.shape}, NNZ: {cooc_matrix_filtered.nnz}")

index_to_recipe_id_map = {idx: recipe_id for recipe_id, idx in recipe_map_filtered.items()}

recipe_network = create_network_from_cooccurrence_matrix(cooc_matrix_filtered, index_to_recipe_id_map)

print("\nStarting graph analysis:")
print(f"Number of nodes: {recipe_network.number_of_nodes()}")
print(f"Number of edges: {recipe_network.number_of_edges()}")

Loading data from ./culinaryDB/04_Recipe-Ingredients_Aliases.csv...

Applying sampling: Keeping 40% of unique recipes...
Original unique recipes: 45749
Sampled unique recipes: 18299
DataFrame filtered to 182950 rows for sampled recipes.
Removed 17902 duplicate (Recipe ID, Entity ID) pairs.
Mapping recipe and ingredient IDs to indices...
Found 18299 unique recipes and 635 unique ingredients.
Creating the sparse bipartite matrix (Recipes x Ingredients)...
Sparse bipartite matrix created in 0.00 seconds.
Calculating the co-occurrence matrix via matrix multiplication (B @ B.T)...
Co-occurrence matrix calculated in 2.26 seconds.
Applying a weight threshold of > 7 to the co-occurrence matrix...
Number of non-zero entries after thresholding: 453956
Extracting upper triangle of the co-occurrence matrix for graph creation efficiency...

Matrix construction completed!
Original (pre-triangle extraction) co-occurrence matrix dimensions: (18299, 18299)
Final sparse matrix (upper triangular) non-zer

---

## 4. Save in `.net` file

In [27]:
if recipe_network.number_of_nodes() > 0: # Only save if there are nodes
    nx.write_pajek(recipe_network, "nets/recipe_similarity_network.net")
    print("\nGraph saved in Pajek NET format: recipe_similarity_network.net")
else:
    print("\nGraph not saved: no nodes in graph to save.")


Graph saved in Pajek NET format: recipe_similarity_network.net


---

## 5. Basic Analysis

In [28]:
# Example of some basic analyses:
# Average degree
if recipe_network.number_of_nodes() > 0:
    average_degree = sum(dict(recipe_network.degree()).values()) / recipe_network.number_of_nodes()
    print(f"Average degree: {average_degree:.2f}")
else:
    print("Cannot calculate average degree: no nodes in graph.")

# Number of connected components
num_connected_components = nx.number_connected_components(recipe_network)
print(f"Number of connected components: {num_connected_components}")

# If you want to calculate centrality (e.g., degree centrality)
if recipe_network.number_of_nodes() > 0: # Check to avoid error on empty graph
    degree_centrality = nx.degree_centrality(recipe_network)
    print("\nTop 5 recipes by degree centrality:")
    for recipe, centrality in sorted(degree_centrality.items(), key=lambda item: item[1], reverse=True)[:5]:
        print(f"Recipe {recipe}: {centrality:.4f}")
else:
    print("Cannot calculate centrality: no nodes in graph.")

# If you want to calculate the clustering coefficient
if recipe_network.number_of_nodes() > 1: # Clustering requires at least 2 nodes
    clustering_coefficient = nx.average_clustering(recipe_network, weight='weight')
    print(f"Average clustering coefficient (weighted): {clustering_coefficient:.4f}")
else:
    print("Cannot calculate clustering coefficient: not enough nodes in graph.")

Average degree: 24.81
Number of connected components: 10800

Top 5 recipes by degree centrality:
Recipe 33233: 0.0837
Recipe 27240: 0.0827
Recipe 26693: 0.0730
Recipe 25328: 0.0691
Recipe 29087: 0.0670
Average clustering coefficient (weighted): 0.0898
