# Introduction
The provided text describes a segment of a code repository belonging to an integrated omics project, authored by a specific group. The purpose of this code is to process a collection of 936 JSON ingredient files, seeking optimal ingredient substitutions across various recipes, such as replacing eggs in bakery recipes. The procedural steps of this process are outlined as follows:

1. **Initial Data Extraction**: The code begins by extracting crucial attributes from the 936 JSON ingredient files. These extracted attributes are then consolidated into a single JSON file.

2. **Graph Creation**: The extracted JSON file serves as the basis for generating a graph. In this graph, ingredients function as nodes, while edges are derived from a selected attribute, such as the count of shared molecules between two ingredients. The resulting graph is subsequently visualized through plotting.

3. **Graph Embeddings**: Employing the Nodes2Vec algorithm, graph embeddings are generated for all ingredients. These embeddings capture the relationships between ingredients within the graph. To quantify ingredient similarities, cosine similarity and Euclidean distances are computed.

4. **Similar Ingredient Identification**: The code identifies ingredients that exhibit the highest similarity to a specified target ingredient.

5. **Dimensionality Reduction**: A dimensionality reduction technique is applied to the dataset. Specifically, T-SNE is used to transform the data's dimensions down to two. This facilitates a clearer understanding of ingredient relationships, which are re-evaluated in this reduced space.

### How organize and maintainable this code is:
After examining the code, it's obvious that it lacks organization and isn't easy to maintain. To address these concerns, it requires significant changes to introduce structure and improve its maintainability. One major issue is the absence of a clear structure, and another problem is the complete lack of functions. This leads to repeated blocks of code appearing throughout the codebase. Additionally, there are various other significant and minor issues that need attention.

In the following part, I will review the code and mention all the strong and weak points of it. Each piece of code contain some internal comments, but if it was necessary, I added some extra comments outside of the code shells. The title of these extra comments is 'General points'. Also, I assigned 'Author notation' to the documentaion provided by the author.

Thus, let's begin the analysis.

### Author notation
Integration of 936 JSON ingredient files downloaded from Flavour DB into a unified file named "integrated_data". Extract the attribute "entity_alias_readable" representing the ingredient and its sub-attribute "molecules". Within "molecules", extract the attributes "flavor_profile", "fooddb_flavor_profile", and "common_name" representing the molecule name, taste, and flavor information.

### General points:

1. she mentioned a description about the following piece of code, which is quite informative. Consequently, I acquire a sense about the code before reading it.

2. the description provided about the first piece of the code is a little bit confusing especially the following part: 'Within "molecules", extract the attributes "flavor_profile", "fooddb_flavor_profile", and "common_name" representing the molecule name, taste, and flavor information.'. the reason is that there is no order in introducing the attribute and their description, and a person should infer that which description belong to which attribute based on the name of each attribute. 

In [3]:
import os
import json
import networkx as nx
import matplotlib.pyplot as plt

#=========================================================
# Specify the folder path containing the JSON files
folder_path = "C:/Users/ghaza/Downloads/ingrediants"
"""
In my opinion one of the best way of reading a path is by using config
file since it prevents us from hardcoding. In this code everytime she
wants to change the path she should change a part of the main body of code
"""
#=========================================================

#=========================================================
# Create a dictionary to store the integrated data
integrated_data = []
"""
she wrote the above comment that she is about to make a dictionary, 
but she made a list!!! I believe that This should not happen for a 
data scientist 
"""
#=========================================================

#=========================================================
#####
# 1 #
#####
# Check if the output file already exists
output_file_path = "C:/Users/ghaza/Downloads/integrated_data.json"
if os.path.exists(output_file_path):
    os.remove(output_file_path)
#####
# 1 #
#####
"""
These two sections can combine with each other to form a function first, 
then a preferably a list or even a dictionary from all the json files paths
can be make and return from this function. Consequently, the mentioned list
can be used as the input of the next part or function of the code. The reason
for this is that first she can break this complex structure into a group of 
simpler building blocks. Second, each function or code building block operates
only one task(single task), for example, a function is responsible for making the path, and
another one is responsible for reading configuration file. Third, the code can
acquire a structure and even architecture which makes it quite more understandable,
and more functional.
"""
#####
# 2 #
#####
# Iterate over each file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
#####
# 2 #
#####
#=========================================================

#=========================================================
#####
# 3 #
#####
        # Read the JSON file
        with open(file_path, "r") as file:
            file_data = json.load(file)
            
            ingredient = file_data.get("entity_alias_readable", "")
            molecules = file_data.get("molecules", [])
            category = file_data.get("category_readable", "")

            # Iterate over molecules and extract relevant data
            for molecule in molecules:
                molecule_info = {
                    "flavor": molecule.get("flavor_profile", ""),
                    "molecule": molecule.get("common_name", ""),
                    "fooddb_flavor_profile": molecule.get("fooddb_flavor_profile", ""),
                    "taste": molecule.get("taste", "")
                    
                }
                ingredient_data = {
                    "ingredients": ingredient,
                    "category":[category],
                    "molecules": [molecule_info]
                    
                }
                """
                This part is good. She kept th code simple and practical and
                try to extract the feature that she wants by only using get command.
                But, in my opinion, the only part that can be optimized is 'molecules'
                for loop. She could have used list comprehension instead of for loop, 
                which is quite faster and there is no need for an extra loop inside the
                main loop.
                """
#=========================================================

#=========================================================
#####
# 4 #
#####
                # Check if ingredient already exists in integrated_data
                existing_ingredient = next((item for item in integrated_data if item["ingredients"] == ingredient), None)

                # If ingredient already exists, append molecule to existing ingredient
                if existing_ingredient:
                    existing_ingredient["molecules"].append(molecule_info)
                else:
                    integrated_data.append(ingredient_data)
                
                """
                This part can be improved by using dictionaries. If she had assigned 'integrated_data'
                to be a dictionary instead of a list, she would have been able to find an ingradient 
                by using an if clause only without need of using next function. Moreover, next sentence
                has more than 100 character, so breaking it into sentences can help us to abide PEP8 rules.
                """
"""
parts '3' and '4' can be combined in a function with the name of for example 'making_integrated_data'.
This can help us to modularize the code even better, and limit each function responsibility to one task. 
"""
#=========================================================

#=========================================================
#####
# 5 #
#####

# Write the integrated data into the output file
with open(output_file_path, "w") as output_file:
    json.dump(integrated_data, output_file, indent=4)

print("Integrated JSON file created successfully.")
"""
This part is good. She saved the 'integrated_data' as a json file.
So, she actually keep the part of the main json file that she wants
and then save it in another file, which is quite good.
"""
"""
Part '5' can be a single function whose duty is to save the integrated
data as a json file in the out_put path.
"""
#=========================================================


Integrated JSON file created successfully.


### General points:
1. This peice of code does not any structure and architecture. it needs a complete set of modularization and rearrangement.
2. This piece of code is an example of hard coding. If one wants to use this for other condition, they need to change a part in the main body.
3. lack of using config file.
5. strong point is that names of the variables can describe what they contain.

one example structure can be as follows:

In [None]:
# example structure

def config():
    pass

def making_file_paths_list():
    pass

def making_integrated_data():
    pass

def saving_file():
    pass

make list of ingrediants names 

In [None]:

import json

# Read the file
with open('C:/Users/ghaza/Downloads/integrated_data.json') as file:
    data = json.load(file)

# Extract the ingredient names
ingredient_names = [item['ingredients'] for item in data]

# Print the ingredient names
ingredient_names
"""
First, she imported json twice!! This again stems in lack of order in 
writing this code. The best way to manage importing internal modules is
to make a seperate cell and put all the modules inside that cell when using
jupyter notebook.
Second, it can be better to make a function to extract some of the features 
of this dataset. In that way she could have visually presented some of the 
key features of the dataset to herself and her colligues.
Third, she used list comprehension which is quite nice.
"""

In [5]:
file_path = 'C:/Users/ghaza/Downloads/ingredient_names.txt'

# Create a dictionary with the 'ingredients' key and the ingredient names list as the value
data =  ingredient_names

# Save the data to the file as JSON
with open(file_path, 'w') as file:
    json.dump(data, file)

"""
This is just four lines of code, but it contains a number of tremendous problems:
First, again the path is used in the main body instead of config file.
Second, the comments written here is more misleading than helpful to undestand the code.
Indeed, she mentioned that she is making a dictionary, but something that she did is to
assign a new name to the list that is made in the privious coding shell. So, when she tried
to save 'data' as a json file somewhere, she just saved the mentioned list.
Third, she wrote the same piece of code twice (the part used for saving the json file). 
However, if she made 'saving_file' function, she could just call the function here.
"""

### Author notation
The resulting graph visually represents the relationships between ingredients, with edges indicating the presence of shared
molecules and the weight (number of shared molecules) displayed as labels on the edges.

### General points
very good clarification about task that the following piece of code will perform.

In [None]:
#=========================================================
import json
import networkx as nx
import matplotlib.pyplot as plt
import pickle
"""
Again!!! importing modules at the middle of the program, beside pickle rest of the
have been imported earlier. json module imported THREE times untill now.
"""
#=========================================================

#=========================================================
# Read the JSON file
with open('C:/Users/ghaza/Downloads/integrated_data.json') as file:
    data = json.load(file)

# Extract ingredient names, molecules, categories, and colors
ingredients_data = data
ingredients = []
category_colors = {}  # Dictionary to store category colors
color_index = 0  # Counter for assigning colors
""" 
I cannot understand why she did this since she could assign 'ingredients_data'
to the data before this line. Moreover, 'ingredients' should be 'ingredients_list'
to get a sense of its type. Also, this can be correct for 'category_colors'. It should
change to 'category_colors_dict'
"""
#=========================================================

#=========================================================
for ingredient in ingredients_data:
    ingredient_dict = {
        'name': ingredient['ingredients'],
        'molecules': [],
        'category': ingredient['category'],
    }

    for molecule in ingredient['molecules']:
        ingredient_dict['molecules'].append(molecule['molecule'])
    ingredients.append(ingredient_dict)
    """
    This part is completely redundant. She wanted to read the elements of the json file that
    she has made priviously in the privious shell and store them into a list. Instead of making
    a new dictionary for each ingredient, and append each ingradient to a list, something that
    she did before, she should have use a list comprehension. Since, first she wanted all the
    information inside the json file, and also she wanted to preserve the structure. The only
    thing that she changed is the name of each gradient from 'ingredients' to name, which is
    totally redundant. The technique with which she could make the same outcome is as follows:

    ingredients_list = [ingredients_data[ingredient] for ingredient in ingredients_data]
    """
#=========================================================

#=========================================================
    # Convert the category to a tuple if it's a list
    category = ingredient['category']
    if isinstance(category, list):
        category = tuple(category)

    # Check if the category already has a color assigned
    if category not in category_colors:
        # Assign a new color to the category
        category_colors[category] = f'C{color_index}'
        color_index += 1
    """
    'category' should be a list!!! because she made a list for category in the ingredient dictionary:
    "category":[category],
    one way can be to make a tuple in the first step, so there would not be any need of double
    checking. But rest of this part is good. she made a dictionary of category: color for the
    next step.
    """
#=========================================================

#=========================================================
# Create an empty graph
graph = nx.Graph()

# Iterate over ingredient pairs
for i in range(len(ingredients)):
    for j in range(i + 1, len(ingredients)):
        ing1 = ingredients[i]
        ing2 = ingredients[j]

        # Check if ing1 and ing2 share a molecule
        shared_molecules = set(ing1['molecules']).intersection(ing2['molecules'])
        if shared_molecules:
            # Add an edge between ing1 and ing2 with the weight of the number of shared molecules
            weight = len(shared_molecules)
            graph.add_edge(ing1['name'], ing2['name'], weight=weight)

            # Assign the category to the nodes
            graph.nodes[ing1['name']]['category'] = category
            graph.nodes[ing2['name']]['category'] = category

# Save the graph using Pickle
with open('graph_shared_molecules_weights.pkl', 'wb') as file:
    pickle.dump(graph, file)
"""
This is a sufficient implementation of making a graph, but there are some problems with it.
First, she used dual for loop here, but it is not actuelly necessary. Instead she could
have used itertools package:
for i, j in itertools.combinations(range(len(ingredients)), 2):
The provided code will give her the same result without using two for loops.

Second, The category part is completely wrong. Indeed, if you have a look at variable 'caregory',
you can see that the last category of the last ingredient stores in it. Consequently, this code 
will assign the same category to all the nodes of the graph.

Third, she again could have used another function to save the graph, but this part of the code does
not have any structure.

Fourth, again this piece of code does not have any structure. Actually, she could have made a function
like 'making_graph' to make the mentioned graph for her.

Fifth: she made a list of ingredient above to loop over all the ingredients. Instead, she could only 
loop through the json file, and gain the same result. Using the following loop:

for i, j in itertools.combinations(ingredients, 2):
    print(i, j)

With this most of the redundant computations and operation that she did can be erased.
"""
#=========================================================

#=========================================================
# Draw the graph with edge labels, category, and color nodes
plt.figure(figsize=(100, 80))
pos = nx.spring_layout(graph)
node_colors = [category_colors[graph.nodes[node]['category']] for node in graph.nodes()]
nx.draw(graph, pos, with_labels=True, node_size=500, node_color=node_colors, edge_color='gray')
labels = nx.get_edge_attributes(graph, 'weight')
nx.draw_networkx_edge_labels(graph, pos, edge_labels=labels)

# Draw category nodes
for category, color in category_colors.items():
    plt.text(0, 0, str(category), color=color, ha='center', fontsize=8)

plt.show()
"""
Here she sketched the graph using matplotlib.
First, she could have made a function like 'drawing_graph' to sketch this graph, and also,
if it is necessary for other parts of the code.

Second, as I expected, all the nodes have the same color. the problem has a root in assigning
the same category to all the nodes (As you can see, all the nodes are brown).

Third, she did not use the category dictionary (category_colors) that she made for the colors
of the nodes.

Fourth, she made node_colors list that contains the same elements for all the nodes. The reson
is that all the nodes' categories are the same.
"""
#=========================================================


### General points:
1. Again this piece of code has the lack of structure, which I belieave makes it hard for the writer to understand some of the severe problems that this code has.
2. Part of the code is redundant. The writer could have implemented all of these operation with about one third less lines of code.
3. The code is hard-coded; thus, the trivial change in the code requires the writer to change the main body of the code.
4. Code contains some logical errors such as assigning wrong category to the nodes that make the outcome invalid.
5. Assigned names to the variables are not suffice. The reason is that one cannot find the type of the variable from its name that make everything more complicated for a reader. It also make everything more complicated for the writer especially if she would come back to code after a while.
6. The code is unnecessarily complex.

Moreover, the above code can be organized in the following functions:

In [None]:
def making_graph():
    pass

def drawing_graph():
    pass

In [None]:
#=========================================================
from node2vec import Node2Vec
"""
Again she imported another module at the middle of the code.
Indded, this can be personal since one wants to show that where is 
the module used in the code. However, I personally do NOT like this notation.
Because, first, this notation can be utilized in jupyter notebook only, and
cannot be used in the .py files. Moreover, I believe it can be messy to import
a module at the middle of the code.
"""
#=========================================================

#=========================================================
# Generate random walks using Node2Vec
# Generate random walks using Node2Vec with tuned parameters
node2vec = Node2Vec(graph, dimensions=128, walk_length=80, num_walks=300)
model = node2vec.fit(window=10, min_count=1)

"""
She used Node2Vec which is the great choice for this problem since the mentioned
package is used to learn node embeddings in graph data. This package is based on word2vec and 
aims to learn embeddings for nodes in such a way that nodes with similar network neighborhoods 
end up having similar embeddings. Consequently, the nodes with stronger edges are
more likely to be from the same community, and more tightly pack in a graph representation.
This embeding can be used for various downstream tasks, such as node classification,
link prediction, and graph visualization. 

The problem with her model is that she mentioned the hyper parameters of this model is 
TUNED; However, I cannot find any approaches such as GridSearchCV used to tune the hyper
parameters. Moreover, when I checked the package's github, I found that in fact two of the
three parameters are the default values ('dimensions', 'walk_length'). Consequently, I believe
that no tuning method had been used here.
link: https://github.com/eliorc/node2vec 
She could have make a dictionary of multiple models using GridSearchCV, and then evaluate these
methods downstream.
"""
#=========================================================

#=========================================================
# Retrieve node embeddings for available ingredients
ingredient_embeddings = {}
for ingredient_dict in ingredients:
    ingredient_name = ingredient_dict['name']
    if ingredient_name in model.wv:
        embedding = model.wv[ingredient_name]
        ingredient_embeddings[ingredient_name] = embedding
    else:
        continue
"""
In this part she wants to make a dictionary of ingradient_name: embedding.
While it can be a good approach to organize all the embeddings in a dictionary, 
she could use dictionary comprehenshion to make this. Moreover, the name 'ingredient_embeddings'
is not a good representative for a dictionary. The dictionary comprehension can be:
ingred_embed_dict = {ingredient_dict['name']:model.wv[ingredient_dict['name']]
                     for ingredient_dict in ingredients if ingredient_dict['name'] in model.wv}
"""
#=========================================================

### General Points
1. the model was not tuned.
2. The model can be a part of a function or even a class (which is far away beyond the scope of this code), so it makes it easier to use the model.
3. A function of combination od Node2Vec and GridSearchCV would be the best function here.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
#=========================================================
# Calculate different similarity matrices
cosine_similarity_matrix = cosine_similarity(list(ingredient_embeddings.values()))
euclidean_distance_matrix = euclidean_distances(list(ingredient_embeddings.values()))

# Example: Find top 5 ingredients similar to "Egg" using different similarity measures
target_ingredient = "Egg"
target_embedding = ingredient_embeddings[target_ingredient]
target_index = list(ingredient_embeddings.keys()).index(target_ingredient)

# Cosine similarity
similar_indices_cosine = cosine_similarity_matrix[target_index].argsort()[::-1][1:30]
similar_ingredients_cosine = [list(ingredient_embeddings.keys())[i] for i in similar_indices_cosine]

# Euclidean distance
similar_indices_euclidean = euclidean_distance_matrix[target_index].argsort()[1:30]
similar_ingredients_euclidean = [list(ingredient_embeddings.keys())[i] for i in similar_indices_euclidean]

print("Ingredients similar to", target_ingredient, "using different similarity measures:")
print("Cosine similarity:")
for ingredient in similar_ingredients_cosine:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")

print("Euclidean distance:")
for ingredient in similar_ingredients_euclidean:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")

"""
In my opinion, this piece of code, by far, is the best part of this notebook in terms of
readability and efficiency. However, There are some points with which one can improve it:

First, she could have put all of these materials in a function whose inputs are ingradient embeddings
dictionary, target ingradient, number of top match ingradients with the target, and list of categories
that the ingradients should not be in them such as 'Meat' and so on. And, its output can be the list of
ingradient and their categories. Moreover, one can even make the function more interactive by puting the
name of the similarity metric as an input.

Second, there are a few comments in the code, and should a little bit more especially for the last part
of this code. While the comments in the first part of this piece is suffice, it is absolutely missleading.
she wrote that 'Find top 5 ingredients similar to "Egg" using different similarity measures'; however, she 
tried to find the first '30' most similar ingredients which can be quite misleading.

Third, the code is hard-coded. By transfering this piece of code in a function, one can make it more robust. 
"""
#=========================================================




In [None]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

#=========================================================
# Perform dimensionality reduction on ingredient embeddings
embeddings = np.array(list(ingredient_embeddings.values()))
ingredient_names = list(ingredient_embeddings.keys())

# Use t-SNE for dimensionality reduction
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)
"""
The first part of this piece of code is good. She made a dimensionality reduction
model, in this case T-SNE. 
"""
#=========================================================

#=========================================================
# Get unique categories
categories = set()
for ingredient in ingredients:
    categories.add(ingredient['category'][0])

# Assign colors to categories
category_colors = {}
color_index = 0
for category in categories:
    category_colors[category] = f'C{color_index}'
    color_index += 1

"""
This part contains repititions:
'category_colors' was made in another privious shell, so instead of making it again
she should have used the privious one. Also, there is no need of making 'categories'
since she assign the distinct categories as the keys of the above 'category_colors'
; consequently, she could have used the above dictionary. As a result, this part is 
completely redundant.
However, if the writer wants to keep this part, she should have used set and dictionary
comprehensions.
"""
#=========================================================

#=========================================================
# Plot the reduced embeddings without ingredient names and color nodes based on categories
plt.figure(figsize=(10, 10))
for i in range(len(ingredient_names)):
    category = next((ingredient['category'][0] for ingredient in ingredients if ingredient['name'] == ingredient_names[i]), None)
    if category:
        plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], color=category_colors[category])

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title("Ingredient Embeddings Visualization")
plt.show()
"""
this part of the code can be optimized by using the following tips:
First, organize this piece of code in a function. This can be quite important in maintainability and
versatility of the code.

Second, using for loop to find the category and plotting point by point can be quite time-consuming and
computation excessive especially when working with a large dataset. Instead, one can use dictionary 
comprehension to first map ingredient names to their categories, then make a list of color based on the
ingrediant categories. Finally, use them in plt.scatter to sketch the scatter plot. Also, instead of using
each point as the x and y axes of the plot at each iteration, one can make the list of x and y components,
then sketch them collectively on the plot.   
"""
#=========================================================


### Author notation
The resulting graph visually represents the relationships between ingredients, with edges indicating the presence of shared flavors and the weight (number of shared flavors) displayed as labels on the edges.

### General points:
The above pieces of codes are all the materials that the author presented in this notebook. Indeed, rest of the code are all repeating the above pieces of codes, but with changing one pre-condition. For example, in the next repeatition of the code, the author utilized 'fooddb_flavor_profile' as the main feature of the ingradient, and tried to find mutual (shared) 'fooddb flavour' between them instead of using molecules. 

The author had to change only two parts in each repitition. First, one item in each ingradient dictionary ('molecules' item), and a part in making graph edges in which the condition of having edge between two nodes because it should be based on the mentioned new feature. Both of the mentioned changes can be implemented conviniently if this code has any kind of structure!! Actually, if the author modularized her code, she would have been able to add the mentioned changes as the input of the functions instead of changing the main body each time.

 Moreover, she copied each piece of code everytime that she wanted to examine a new feature that even make the code more hard-coded. Consequently, the most important and urgent correction for this code other than all the points that I have mentioned above is to construct a flexible and versatile structure for this code.

In the following parts, I will only mark the repeated parts, and also the converted sections.

In [None]:
import json
import networkx as nx
import matplotlib.pyplot as plt
import pickle
#=========================================================
# Read the JSON file
with open('C:/Users/ghaza/Downloads/integrated_data.json') as file:
    data = json.load(file)

# Extract ingredient names, molecules, and categories
ingredients_data = data
ingredients = []
for ingredient in ingredients_data:
    ingredient_dict = {
        'name': ingredient['ingredients'],
        'molecules': {},
        'category': ingredient['category']
    }

    """
    the same code
    """
#=========================================================

#=========================================================
    for molecule in ingredient['molecules']:
        ingredient_dict['molecules'][molecule['molecule']] = molecule['fooddb_flavor_profile']
    ingredients.append(ingredient_dict)
"""
changing one feature
"""
#=========================================================

#=========================================================
# Create an empty graph
graph = nx.Graph()

# Iterate over ingredient pairs
for i in range(len(ingredients)):
    for j in range(i + 1, len(ingredients)):
        ing1 = ingredients[i]
        ing2 = ingredients[j]

        # Find shared molecules
        shared_molecules = set(ing1['molecules'].keys()) & set(ing2['molecules'].keys())

        # Process shared molecules and flavors
        shared_flavors = []
        for molecule in shared_molecules:
            flavors = ing1['molecules'][molecule].split("@")
            shared_flavors.extend(flavor for flavor in flavors)
        weight = len(set(shared_flavors))

        # Add an edge with the weight between ing1 and ing2
        if weight > 0:
            graph.add_edge(ing1['name'], ing2['name'], weight=weight)

"""
The author made a graph and found that it is wrong, and she did not even erase this part
from the body of her code, and just run the program!!!!!
"""
#=========================================================

#=========================================================
# Create a mapping of categories to colors
category_colors = {}
color_index = 0
for ingredient in ingredients:
    category = ingredient['category']
    if isinstance(category, list):
        category = tuple(category)
    if category not in category_colors:
        category_colors[category] = f'C{color_index}'
        color_index += 1
"""
the same code
"""
#=========================================================

#=========================================================
# Assign category colors to nodes
node_colors = [category_colors[tuple(ingredient['category'])] for ingredient in ingredients]

# Remove duplicate ingredients
unique_ingredients = []
ingredient_names = set()  # Keep track of ingredient names
for ingredient in ingredients:
    name = ingredient['name']
    if name not in ingredient_names:
        unique_ingredients.append(ingredient)
        ingredient_names.add(name)

# Use unique_ingredients list for further processing and graph creation
ingredients = unique_ingredients
node_names = [ingredient['name'] for ingredient in ingredients]  # Extract the unique ingredient names

# Update node_colors based on the unique ingredient names
node_colors = [node_colors[node_names.index(name)] for name in node_names]
"""
This one is different from other parts; however, she only used this part in this specific repeatition.
I am pretty sure that seh could find the unique ingredient from other practical ways.
"""
#=========================================================

#=========================================================
# Create a new graph with unique ingredients
graph = nx.Graph()

# Iterate over ingredient pairs
for i in range(len(ingredients)):
    for j in range(i + 1, len(ingredients)):
        ing1 = ingredients[i]
        ing2 = ingredients[j]

        # Find shared molecules
        shared_molecules = set(ing1['molecules'].keys()) & set(ing2['molecules'].keys())

        # Process shared molecules and flavors
        shared_flavors = []
        for molecule in shared_molecules:
            flavors = ing1['molecules'][molecule].split("@")
            shared_flavors.extend(flavor for flavor in flavors)
        weight = len(set(shared_flavors))

        # Add an edge with the weight between ing1 and ing2
        if weight > 0:
            graph.add_edge(ing1['name'], ing2['name'], weight=weight)

"""
All the parts of the graph are the same the only part which is different is the
condition to have an edge between two nodes, and the weight of the nodes
"""
#=========================================================

#=========================================================
# Save the graph using Pickle
with open('graph_shared_flavors_weights.pkl', 'wb') as file:
    pickle.dump(graph, file)

# Draw the graph
plt.figure(figsize=(100, 80))
pos = nx.spring_layout(graph, seed=42)  # Set a fixed seed for consistent layout
weights = nx.get_edge_attributes(graph, 'weight')

# Convert node_colors to a list of colors corresponding to node_names
node_colors = [node_colors[node_names.index(node)] for node in graph.nodes]

# Draw nodes with correct colors
nx.draw_networkx_nodes(graph, pos, node_color=node_colors, node_size=2000, cmap='rainbow')
nx.draw_networkx_edges(graph, pos)
nx.draw_networkx_labels(graph, pos, font_size=20)

# Draw edge labels
nx.draw_networkx_edge_labels(graph, pos, edge_labels=weights, font_size=100)

# Draw category nodes
for category, color in category_colors.items():
    plt.text(0, 0, str(category), color=color, ha='center', fontsize=8)

plt.axis('off')
plt.show()

"""
the same functionality, but she tried another code to create the graph, which is
absolutely unnecessary.
"""
#=========================================================




In [None]:
from node2vec import Node2Vec
#=========================================================
# Generate random walks using Node2Vec
# Generate random walks using Node2Vec with tuned parameters
node2vec = Node2Vec(graph, dimensions=128, walk_length=80, num_walks=300)
model = node2vec.fit(window=10, min_count=1)


# Retrieve node embeddings for available ingredients
ingredient_embeddings = {}
for ingredient_dict in ingredients:
    ingredient_name = ingredient_dict['name']
    if ingredient_name in model.wv:
        embedding = model.wv[ingredient_name]
        ingredient_embeddings[ingredient_name] = embedding
    else:
        continue
"""
the same code
"""
#=========================================================

In [None]:
#=========================================================
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

# Calculate different similarity matrices
cosine_similarity_matrix = cosine_similarity(list(ingredient_embeddings.values()))
euclidean_distance_matrix = euclidean_distances(list(ingredient_embeddings.values()))

# Example: Find top 5 ingredients similar to "Egg" using different similarity measures
target_ingredient = "Egg"
target_embedding = ingredient_embeddings[target_ingredient]
target_index = list(ingredient_embeddings.keys()).index(target_ingredient)

# Cosine similarity
similar_indices_cosine = cosine_similarity_matrix[target_index].argsort()[::-1][1:30]
similar_ingredients_cosine = [list(ingredient_embeddings.keys())[i] for i in similar_indices_cosine]

# Euclidean distance
similar_indices_euclidean = euclidean_distance_matrix[target_index].argsort()[1:30]
similar_ingredients_euclidean = [list(ingredient_embeddings.keys())[i] for i in similar_indices_euclidean]

print("Ingredients similar to", target_ingredient, "using different similarity measures:")
print("Cosine similarity:")
for ingredient in similar_ingredients_cosine:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")

print("Euclidean distance:")
for ingredient in similar_ingredients_euclidean:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")
"""
the same code
"""
#=========================================================

In [None]:
#=========================================================
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Perform dimensionality reduction on ingredient embeddings
embeddings = np.array(list(ingredient_embeddings.values()))
ingredient_names = list(ingredient_embeddings.keys())

# Use t-SNE for dimensionality reduction
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)

# Get unique categories
categories = set()
for ingredient in ingredients:
    categories.add(ingredient['category'][0])

# Assign colors to categories
category_colors = {}
color_index = 0
for category in categories:
    category_colors[category] = f'C{color_index}'
    color_index += 1

# Plot the reduced embeddings without ingredient names and color nodes based on categories
plt.figure(figsize=(10, 10))
for i in range(len(ingredient_names)):
    category = next((ingredient['category'][0] for ingredient in ingredients if ingredient['name'] == ingredient_names[i]), None)
    if category:
        plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], color=category_colors[category])

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title("Ingredient Embeddings Visualization")
plt.show()
"""
the same code
"""
#=========================================================

### Author notation
In this modified code, each ingredient is connected to the top 10 ingredients that have the most shared flavors with it. The resulting graph will reflect these connections. The top 10 ingredients with the most shared flavors are connected to the current ingredient in the graph.

In [None]:
import json
import networkx as nx
import matplotlib.pyplot as plt
#=========================================================
# Read the JSON file
with open('C:/Users/ghaza/Downloads/integrated_data.json') as file:
    data = json.load(file)

# Extract ingredient names, flavors, and categories
ingredients_data = data
ingredients = []
category_colors = {}
color_index = 0

for ingredient in ingredients_data:
    ingredient_dict = {
        'name': ingredient['ingredients'],
        'flavors': [],
        'category': ingredient['category']
    }
    """
    the same code
    """
#=========================================================

#=========================================================
    for molecule in ingredient['molecules']:
        ingredient_dict['flavors'].extend(molecule['flavor'].split('@'))
    ingredients.append(ingredient_dict)
    """
    changing one feature
    """
#=========================================================

#=========================================================
    # Store the category-color mapping
    category = ingredient['category']
    if isinstance(category, list):
        category = tuple(category)
    if category not in category_colors:
        category_colors[category] = f'C{color_index}'
        color_index += 1
"""
the same code
"""
#=========================================================

#=========================================================
# Create an empty graph
graph = nx.Graph()

# Iterate over ingredients
for i in range(len(ingredients)):
    ing1 = ingredients[i]
    shared_counts = []

    # Calculate shared flavor counts with other ingredients
    for j in range(len(ingredients)):
        if i != j:
            ing2 = ingredients[j]
            shared_count = len(set(ing1['flavors']).intersection(ing2['flavors']))
            shared_counts.append((j, shared_count))

    # Sort by shared flavor counts in descending order
    shared_counts.sort(key=lambda x: x[1], reverse=True)

    # Connect ing1 to the top 10 ingredients with the most shared flavors
    for j, _ in shared_counts[:10]:
        ing2 = ingredients[j]
        graph.add_edge(ing1['name'], ing2['name'])
"""
All the parts of the graph are the same the only part which is different is the
condition to have an edge between two nodes, and the weight of the nodes
"""
#=========================================================

#=========================================================
# Save the graph using Pickle
with open('graph_most_shared_flavors.pkl', 'wb') as file:
    pickle.dump(graph, file)

# Draw the graph
plt.figure(figsize=(100, 80))  # Adjust the figure size as desired (width, height)

# Use spring layout with fixed seed for consistent layout
pos = nx.spring_layout(graph, seed=42)

# Draw nodes with correct colors based on categories
for ingredient in ingredients:
    category = ingredient['category']
    if isinstance(category, list):
        category = tuple(category)
    nx.draw_networkx_nodes(
        graph,
        pos,
        nodelist=[ingredient['name']],
        node_color=category_colors[category],
        node_size=2000,
        cmap='rainbow'
    )

# Draw edges
nx.draw_networkx_edges(graph, pos)

# Draw labels
nx.draw_networkx_labels(graph, pos, font_size=20)

# Draw category nodes
for category, color in category_colors.items():
    plt.text(0, 0, str(category), color=color, ha='center', fontsize=8)

plt.axis('off')
plt.show()
"""
the same functionality, but she tried another code to create the graph, which is
absolutely unnecessary.
"""
#=========================================================




In [None]:
from node2vec import Node2Vec
#=========================================================
# Generate random walks using Node2Vec
node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200)
model = node2vec.fit(window=10, min_count=1)

# Retrieve node embeddings for available ingredients
ingredient_embeddings = {}
for ingredient_dict in ingredients:
    ingredient_name = ingredient_dict['name']
    if ingredient_name in model.wv:
        embedding = model.wv[ingredient_name]
        ingredient_embeddings[ingredient_name] = embedding
    else:
        continue
"""
the same code
"""
#=========================================================

In [None]:
#=========================================================
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

# Calculate different similarity matrices
cosine_similarity_matrix = cosine_similarity(list(ingredient_embeddings.values()))
euclidean_distance_matrix = euclidean_distances(list(ingredient_embeddings.values()))

# Example: Find top 5 ingredients similar to "Egg" using different similarity measures
target_ingredient = "Egg"
target_embedding = ingredient_embeddings[target_ingredient]
target_index = list(ingredient_embeddings.keys()).index(target_ingredient)

# Cosine similarity
similar_indices_cosine = cosine_similarity_matrix[target_index].argsort()[::-1][1:30]
similar_ingredients_cosine = [list(ingredient_embeddings.keys())[i] for i in similar_indices_cosine]

# Euclidean distance
similar_indices_euclidean = euclidean_distance_matrix[target_index].argsort()[1:30]
similar_ingredients_euclidean = [list(ingredient_embeddings.keys())[i] for i in similar_indices_euclidean]

print("Ingredients similar to", target_ingredient, "using different similarity measures:")
print("Cosine similarity:")
for ingredient in similar_ingredients_cosine:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")

print("Euclidean distance:")
for ingredient in similar_ingredients_euclidean:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")

"""
the same code
"""
#=========================================================

In [None]:
#=========================================================
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Perform dimensionality reduction on ingredient embeddings
embeddings = np.array(list(ingredient_embeddings.values()))
ingredient_names = list(ingredient_embeddings.keys())

# Use t-SNE for dimensionality reduction
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)

# Get unique categories
categories = set()
for ingredient in ingredients:
    categories.add(ingredient['category'][0])

# Assign colors to categories
category_colors = {}
color_index = 0
for category in categories:
    category_colors[category] = f'C{color_index}'
    color_index += 1

# Plot the reduced embeddings without ingredient names and color nodes based on categories
plt.figure(figsize=(10, 10))
for i in range(len(ingredient_names)):
    category = next((ingredient['category'][0] for ingredient in ingredients if ingredient['name'] == ingredient_names[i]), None)
    if category:
        plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], color=category_colors[category])

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title("Ingredient Embeddings Visualization")
plt.show()

"""
the same code
"""
#=========================================================

### Author notation
In this modified code, each ingredient is connected to the top 10 ingredients that have the most shared flavors with it. The resulting graph will reflect these connections. The top 10 ingredients with the most shared flavors are connected to the current ingredient in the graph.The weights of edges are the count af shared flavours.

In [None]:
import json
import networkx as nx
import matplotlib.pyplot as plt

#=========================================================
# Read the JSON file
with open('C:/Users/ghaza/Downloads/integrated_data.json') as file:
    data = json.load(file)

# Extract ingredient names, flavors, and categories
ingredients_data = data
ingredients = []
categories = set()
for ingredient in ingredients_data:
    ingredient_dict = {
        'name': ingredient['ingredients'],
        'flavors': [],
        'category': ingredient['category']
    }
    """
    the same code
    """
#=========================================================

#=========================================================
    for molecule in ingredient['molecules']:
        ingredient_dict['flavors'].extend(molecule['flavor'].split('@'))
    ingredients.append(ingredient_dict)
    categories.add(tuple(ingredient['category']))  # Convert category list to tuple
"""
changing one feature
"""
#=========================================================

#=========================================================
# Create a mapping of categories to colors
category_colors = {}
color_index = 0
for category in categories:
    category_colors[category] = f'C{color_index}'
    color_index += 1
"""
the same code
"""
#=========================================================

#=========================================================
# Create an empty graph
graph = nx.Graph()

# Iterate over ingredients
for i in range(len(ingredients)):
    ing1 = ingredients[i]
    shared_counts = []

    # Calculate shared flavor counts with other ingredients
    for j in range(len(ingredients)):
        if i != j:
            ing2 = ingredients[j]
            shared_count = len(set(ing1['flavors']).intersection(ing2['flavors']))
            shared_counts.append((j, shared_count))

    # Sort by shared flavor counts in descending order
    shared_counts.sort(key=lambda x: x[1], reverse=True)

    # Connect ing1 to the top 10 ingredients with the most shared flavors
    for j, _ in shared_counts[:10]:
        ing2 = ingredients[j]
        graph.add_edge(ing1['name'], ing2['name'])

"""
All the parts of the graph are the same the only part which is different is the
condition to have an edge between two nodes, and the weight of the nodes
"""
#=========================================================

#=========================================================
# Draw the graph
plt.figure(figsize=(100, 80))  # Adjust the figure size as desired (width, height)

# Use spring layout with fixed seed for consistent layout
pos = nx.spring_layout(graph, seed=42)

# Draw nodes with correct colors based on categories
node_colors = [category_colors[tuple(ingredient['category'])] for ingredient in ingredients]
nx.draw_networkx_nodes(graph, pos, node_color=node_colors, node_size=2000, cmap='rainbow')

# Draw edges
nx.draw_networkx_edges(graph, pos)

# Draw labels
nx.draw_networkx_labels(graph, pos, font_size=20)

# Draw category nodes
for category, color in category_colors.items():
    plt.text(0, 0, str(category), color=color, ha='center', fontsize=8)

plt.axis('off')
plt.show()
"""
the same code
"""
#=========================================================


In [None]:
from node2vec import Node2Vec
#=========================================================
# Generate random walks using Node2Vec
node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200)
model = node2vec.fit(window=10, min_count=1)

# Retrieve node embeddings for available ingredients
ingredient_embeddings = {}
for ingredient_dict in ingredients:
    ingredient_name = ingredient_dict['name']
    if ingredient_name in model.wv:
        embedding = model.wv[ingredient_name]
        ingredient_embeddings[ingredient_name] = embedding
    else:
        continue
"""
the same code
"""
#=========================================================

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
#=========================================================
# Calculate different similarity matrices
cosine_similarity_matrix = cosine_similarity(list(ingredient_embeddings.values()))
euclidean_distance_matrix = euclidean_distances(list(ingredient_embeddings.values()))

# Example: Find top 5 ingredients similar to "Egg" using different similarity measures
target_ingredient = "Egg"
target_embedding = ingredient_embeddings[target_ingredient]
target_index = list(ingredient_embeddings.keys()).index(target_ingredient)

# Cosine similarity
similar_indices_cosine = cosine_similarity_matrix[target_index].argsort()[::-1][1:30]
similar_ingredients_cosine = [list(ingredient_embeddings.keys())[i] for i in similar_indices_cosine]

# Euclidean distance
similar_indices_euclidean = euclidean_distance_matrix[target_index].argsort()[1:30]
similar_ingredients_euclidean = [list(ingredient_embeddings.keys())[i] for i in similar_indices_euclidean]

print("Ingredients similar to", target_ingredient, "using different similarity measures:")
print("Cosine similarity:")
for ingredient in similar_ingredients_cosine:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")

print("Euclidean distance:")
for ingredient in similar_ingredients_euclidean:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fungus','Fish']:
        print(f"Ingredient: {ingredient}, Category: {category}")

"""
the same code
"""
#=========================================================

In [None]:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
#=========================================================
# Perform dimensionality reduction on ingredient embeddings
embeddings = np.array(list(ingredient_embeddings.values()))
ingredient_names = list(ingredient_embeddings.keys())

# Use t-SNE for dimensionality reduction
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)

# Get unique categories
categories = set()
for ingredient in ingredients:
    categories.add(ingredient['category'][0])

# Assign colors to categories
category_colors = {}
color_index = 0
for category in categories:
    category_colors[category] = f'C{color_index}'
    color_index += 1

# Plot the reduced embeddings without ingredient names and color nodes based on categories
plt.figure(figsize=(10, 10))
for i in range(len(ingredient_names)):
    category = next((ingredient['category'][0] for ingredient in ingredients if ingredient['name'] == ingredient_names[i]), None)
    if category:
        plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], color=category_colors[category])

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title("Ingredient Embeddings Visualization")
plt.show()

"""
the same code
"""
#=========================================================

in my opinion is the best approach

### Author notation
In this modified code, each ingredient is connected to the top 10 ingredients that have the most shared molecules with it. The resulting graph will reflect these connections.

In [None]:
import json
import networkx as nx
import matplotlib.pyplot as plt
#=========================================================
# Read the JSON file
with open('C:/Users/ghaza/Downloads/integrated_data.json') as file:
    data = json.load(file)

# Extract ingredient names, molecules, and categories
ingredients_data = data
ingredients = []
for ingredient in ingredients_data:
    ingredient_dict = {
        'name': ingredient['ingredients'],
        'molecules': [],
        'category': ingredient['category']  # Add the category information
    }
    """
    the same code
    """
#=========================================================

#=========================================================
    for molecule in ingredient['molecules']:
        ingredient_dict['molecules'].append(molecule['molecule'])
    ingredients.append(ingredient_dict)
"""
changing one feature
"""
#=========================================================

#=========================================================
# Create an empty graph
graph = nx.Graph()

# Iterate over ingredients
for i in range(len(ingredients)):
    ing1 = ingredients[i]
    shared_counts = []
    
    # Calculate shared molecule counts with other ingredients
    for j in range(len(ingredients)):
        if i != j:
            ing2 = ingredients[j]
            shared_count = len(set(ing1['molecules']).intersection(ing2['molecules']))
            shared_counts.append((j, shared_count))
    
    # Sort by shared molecule counts in descending order
    shared_counts.sort(key=lambda x: x[1], reverse=True)
    
    # Connect ing1 to the top 10 ingredients with the most shared molecules
    for j, _ in shared_counts[:10]:
        ing2 = ingredients[j]
        graph.add_edge(ing1['name'], ing2['name'])
"""
All the parts of the graph are the same the only part which is different is the
condition to have an edge between two nodes, and the weight of the nodes
"""
#=========================================================

#=========================================================
# Create a mapping of categories to colors
category_colors = {}
color_index = 0
for ingredient in ingredients:
    category = ingredient['category']
    if isinstance(category, list):
        category = tuple(category)
    if category not in category_colors:
        category_colors[category] = f'C{color_index}'
        color_index += 1
"""
the same code
"""
#=========================================================

#=========================================================
# Draw the graph
plt.figure(figsize=(100, 80))  # Adjust the figure size as desired (width, height)
pos = nx.spring_layout(graph)  # Positions the nodes using the spring layout algorithm

# Assign node colors based on categories
node_colors = [category_colors[tuple(ingredient['category'])] for ingredient in ingredients]
nx.draw_networkx_nodes(graph, pos, node_color=node_colors, node_size=2000, cmap='rainbow')

# Draw edges
nx.draw_networkx_edges(graph, pos)

# Draw labels
nx.draw_networkx_labels(graph, pos, font_size=20)

plt.axis('off')
plt.show()
"""
the same code
"""
#=========================================================


In [None]:
from node2vec import Node2Vec
#=========================================================
# Generate random walks using Node2Vec
node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200)
model = node2vec.fit(window=10, min_count=1)

# Retrieve node embeddings for available ingredients
ingredient_embeddings = {}
for ingredient_dict in ingredients:
    ingredient_name = ingredient_dict['name']
    if ingredient_name in model.wv:
        embedding = model.wv[ingredient_name]
        ingredient_embeddings[ingredient_name] = embedding
    else:
        continue
"""
the same code
"""
#=========================================================

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
#=========================================================
# Calculate different similarity matrices
cosine_similarity_matrix = cosine_similarity(list(ingredient_embeddings.values()))
euclidean_distance_matrix = euclidean_distances(list(ingredient_embeddings.values()))

# Example: Find top 5 ingredients similar to "Egg" using different similarity measures
target_ingredient = "Egg"
target_embedding = ingredient_embeddings[target_ingredient]
target_index = list(ingredient_embeddings.keys()).index(target_ingredient)

# Cosine similarity
similar_indices_cosine = cosine_similarity_matrix[target_index].argsort()[::-1][1:30]
similar_ingredients_cosine = [list(ingredient_embeddings.keys())[i] for i in similar_indices_cosine]

# Euclidean distance
similar_indices_euclidean = euclidean_distance_matrix[target_index].argsort()[1:30]
similar_ingredients_euclidean = [list(ingredient_embeddings.keys())[i] for i in similar_indices_euclidean]

print("Ingredients similar to", target_ingredient, "using different similarity measures:")
print("Cosine similarity:")
for ingredient in similar_ingredients_cosine:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fungus','Fish']:
        print(f"Ingredient: {ingredient}, Category: {category}")

print("Euclidean distance:")
for ingredient in similar_ingredients_euclidean:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")
"""
the same code
"""
#=========================================================

In [None]:

#=========================================================
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Perform dimensionality reduction on ingredient embeddings
embeddings = np.array(list(ingredient_embeddings.values()))
ingredient_names = list(ingredient_embeddings.keys())

# Use t-SNE for dimensionality reduction
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)

# Get unique categories
categories = set()
for ingredient in ingredients:
    categories.add(ingredient['category'][0])

# Assign colors to categories
category_colors = {}
color_index = 0
for category in categories:
    category_colors[category] = f'C{color_index}'
    color_index += 1

# Plot the reduced embeddings without ingredient names and color nodes based on categories
plt.figure(figsize=(10, 10))
for i in range(len(ingredient_names)):
    category = next((ingredient['category'][0] for ingredient in ingredients if ingredient['name'] == ingredient_names[i]), None)
    if category:
        plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], color=category_colors[category])

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title("Ingredient Embeddings Visualization")
plt.show()
"""
the same code
"""
#=========================================================

### Author notation
In this modified code, each ingredient is connected to the top 10 ingredients that have the most shared molecules with it. The resulting graph will reflect these connections and the count of shared molocules as considered as the weight of edges 

In [None]:
import json
import networkx as nx
import matplotlib.pyplot as plt
#=========================================================
# Read the JSON file
with open('C:/Users/ghaza/Downloads/integrated_data.json') as file:
    data = json.load(file)

# Extract ingredient names, molecules, and categories
ingredients_data = data
ingredients = []
for ingredient in ingredients_data:
    ingredient_dict = {
        'name': ingredient['ingredients'],
        'molecules': [],
        'category': ingredient['category']  # Add the category information
    }
    """
    the same code
    """
#=========================================================

#=========================================================
    for molecule in ingredient['molecules']:
        ingredient_dict['molecules'].append(molecule['molecule'])
    ingredients.append(ingredient_dict)
"""
changinf the featue
"""
#=========================================================

#=========================================================
# Create an empty graph
graph = nx.Graph()

# Iterate over ingredients
for i in range(len(ingredients)):
    ing1 = ingredients[i]
    shared_counts = []
    
    # Calculate shared molecule counts with other ingredients
    for j in range(len(ingredients)):
        if i != j:
            ing2 = ingredients[j]
            shared_count = len(set(ing1['molecules']).intersection(ing2['molecules']))
            shared_counts.append((j, shared_count))
    
    # Sort by shared molecule counts in descending order
    shared_counts.sort(key=lambda x: x[1], reverse=True)
    
    # Connect ing1 to the top 10 ingredients with the most shared molecules
    for j, shared_count in shared_counts[:10]:
        ing2 = ingredients[j]
        if shared_count > 0:
            graph.add_edge(ing1['name'], ing2['name'], weight=shared_count)
"""
All the parts of the graph are the same the only part which is different is the
condition to have an edge between two nodes, and the weight of the nodes
"""
#=========================================================

#=========================================================
# Create a mapping of categories to colors
category_colors = {}
color_index = 0
for ingredient in ingredients:
    category = ingredient['category']
    if isinstance(category, list):
        category = tuple(category)
    if category not in category_colors:
        category_colors[category] = f'C{color_index}'
        color_index += 1
"""
the same code
"""
#=========================================================

#=========================================================
plt.figure(figsize=(100, 80))  # Adjust the figure size as desired (width, height)
pos = nx.spring_layout(graph)
weights = nx.get_edge_attributes(graph, 'weight')

# Assign node colors based on categories for all nodes in the graph
node_colors = [category_colors[tuple(ingredient['category'])] for ingredient in ingredients if ingredient['name'] in graph.nodes]

# Draw the graph with colored nodes
nx.draw_networkx(graph, pos, with_labels=True, node_color=node_colors, node_size=100, font_size=20, cmap=plt.cm.rainbow)

# Draw edge labels
nx.draw_networkx_edge_labels(graph, pos, edge_labels=weights, font_size=10)

plt.show()
"""
the same code, with sime slight differences
"""
#=========================================================

In [None]:
from node2vec import Node2Vec
#=========================================================
# Generate random walks using Node2Vec
node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200)
model = node2vec.fit(window=10, min_count=1)

# Retrieve node embeddings for available ingredients
ingredient_embeddings = {}
for ingredient_dict in ingredients:
    ingredient_name = ingredient_dict['name']
    try:
        embedding = model.wv[ingredient_name]
        ingredient_embeddings[ingredient_name] = embedding
    except KeyError:
        continue
"""
the same code, I believe she accidentally added a try except loop here!!!!
is quite better structure than privious one though.
"""
#=========================================================

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
#=========================================================
# Calculate different similarity matrices
cosine_similarity_matrix = cosine_similarity(list(ingredient_embeddings.values()))
euclidean_distance_matrix = euclidean_distances(list(ingredient_embeddings.values()))

# Example: Find top 5 ingredients similar to "Egg" using different similarity measures
target_ingredient = "Egg"
target_embedding = ingredient_embeddings[target_ingredient]
target_index = list(ingredient_embeddings.keys()).index(target_ingredient)

# Cosine similarity
similar_indices_cosine = cosine_similarity_matrix[target_index].argsort()[::-1][1:30]
similar_ingredients_cosine = [list(ingredient_embeddings.keys())[i] for i in similar_indices_cosine]

# Euclidean distance
similar_indices_euclidean = euclidean_distance_matrix[target_index].argsort()[1:30]
similar_ingredients_euclidean = [list(ingredient_embeddings.keys())[i] for i in similar_indices_euclidean]

print("Ingredients similar to", target_ingredient, "using different similarity measures:")
print("Cosine similarity:")
for ingredient in similar_ingredients_cosine:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood','Fish','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")

print("Euclidean distance:")
for ingredient in similar_ingredients_euclidean:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat','Fish', 'Seafood','Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")
"""
the same code
"""
#=========================================================

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
#=========================================================
# Calculate different similarity matrices
cosine_similarity_matrix = cosine_similarity(list(ingredient_embeddings.values()))
euclidean_distance_matrix = euclidean_distances(list(ingredient_embeddings.values()))

# Example: Find top 5 ingredients similar to "Egg" using different similarity measures
target_ingredient = "Egg"
target_embedding = ingredient_embeddings[target_ingredient]
target_index = list(ingredient_embeddings.keys()).index(target_ingredient)

# Cosine similarity
similar_indices_cosine = cosine_similarity_matrix[target_index].argsort()[::-1][1:30]
similar_ingredients_cosine = [list(ingredient_embeddings.keys())[i] for i in similar_indices_cosine]
similar_ingredients_found_cosine = []

# Euclidean distance
similar_indices_euclidean = euclidean_distance_matrix[target_index].argsort()[1:30]
similar_ingredients_euclidean = [list(ingredient_embeddings.keys())[i] for i in similar_indices_euclidean]
similar_ingredients_found_euclidean = []

print("Ingredients similar to", target_ingredient, "using different similarity measures:")
print("Cosine similarity:")
for ingredient in similar_ingredients_cosine:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Seafood', 'Fish', 'Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")
        similar_ingredients_found_cosine.append(ingredient)

print("Euclidean distance:")
for ingredient in similar_ingredients_euclidean:
    category = next((item['category'] for item in ingredients if item['name'] == ingredient), None)
    if category and category[0] not in ['Meat', 'Fish', 'Seafood', 'Fungus']:
        print(f"Ingredient: {ingredient}, Category: {category}")
        similar_ingredients_found_euclidean.append(ingredient)

print("Ingredients similar to", target_ingredient, "found using cosine similarity:")
print(similar_ingredients_found_cosine)

print("Ingredients similar to", target_ingredient, "found using euclidean distance:")
print(similar_ingredients_found_euclidean)
"""
the same code
"""
#=========================================================

In [None]:
#=========================================================

import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Perform dimensionality reduction on ingredient embeddings
embeddings = np.array(list(ingredient_embeddings.values()))
ingredient_names = list(ingredient_embeddings.keys())

# Use t-SNE for dimensionality reduction
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)

# Get unique categories
categories = set()
for ingredient in ingredients:
    categories.add(ingredient['category'][0])

# Assign colors to categories
category_colors = {}
color_index = 0
for category in categories:
    category_colors[category] = f'C{color_index}'
    color_index += 1

# Plot the reduced embeddings without ingredient names and color nodes based on categories
plt.figure(figsize=(10, 10))
for i in range(len(ingredient_names)):
    category = next((ingredient['category'][0] for ingredient in ingredients if ingredient['name'] == ingredient_names[i]), None)
    if category:
        plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], color=category_colors[category])

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title("Ingredient Embeddings Visualization")
plt.show()
"""
the same code
"""
#=========================================================

### Author notation
To create an AI-powered tool that recommends egg replacements in bakery recipes, you can follow these steps:

Data Collection: Gather a dataset of bakery recipes along with their ingredients and instructions. You can obtain recipe data from various online sources or use existing recipe datasets available.

Preprocess the Data: Clean and preprocess the recipe data to remove any irrelevant information and standardize the ingredient names. You can use techniques like tokenization, lowercasing, and removing punctuation to prepare the data for further processing.

Build Ingredient Embeddings: Train or load a pre-trained word embedding model (such as Word2Vec or GloVe) using a large corpus of text data. Map each ingredient to its corresponding word embedding vector. This step captures the semantic relationships between ingredients.

Implement Egg Replacement Logic: Write a function or algorithm that takes a recipe as input and identifies the presence of eggs. If eggs are found, use the ingredient embeddings to recommend suitable replacements based on similarity measures like cosine similarity or Euclidean distance. The closest ingredient embeddings can be considered as potential replacements.

Deploy the AI Tool: Create a user interface or API to interact with the AI-powered tool. Users can input a bakery recipe, and the tool will provide recommendations for egg replacements based on the implemented logic. The tool can display the recommended replacements and their corresponding similarity scores.

Here's a code snippet to give you an idea of how to implement the egg replacement logic:

python
Copy code
from sklearn.metrics.pairwise import cosine_distances
import numpy as np

def recommend_egg_replacement(recipe, ingredient_embeddings, threshold=0.7):
    # Check if the recipe contains eggs
    if 'egg' not in recipe.lower():
        return "No egg replacements needed."

    # Get the embedding for the 'egg' ingredient
    egg_embedding = ingredient_embeddings['egg']

    # Calculate similarity distances between the 'egg' embedding and other ingredients
    distances = cosine_distances([egg_embedding], list(ingredient_embeddings.values()))

    # Find ingredients with similarity above the threshold
    replacements = []
    for i, distance in enumerate(distances[0]):
        if distance < threshold:
            replacements.append(list(ingredient_embeddings.keys())[i])

    if len(replacements) == 0:
        return "No suitable replacements found."
    else:
        return replacements

# Example usage
recipe = "Classic chocolate chip cookies with eggs and butter"
replacements = recommend_egg_replacement(recipe, ingredient_embeddings, threshold=0.7)
print("Recommended egg replacements:", replacements)
Please note that the above code assumes you have already obtained the ingredient embeddings and stored them in the ingredient_embeddings dictionary. The threshold parameter is used to control the similarity threshold for considering ingredient replacements.

Remember to adapt the code to your specific dataset and requirements, and ensure that you have the necessary libraries installed (e.g., scikit-learn for cosine_distances).

### General Points
In this part the author tried to make a response based on chat-gpt. She made a of gradient that can be used instead of egg and put them as an input into chat-gpt, then extract the answer from the mentioned bot. The last piece of the program is the most complete one. I will write some comments on that part, and erase rest of it to avoid confusion.

In [7]:
import json
import requests
import time
from colorama import Fore, Style
#=========================================================
API_KEY = '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'
url = 'https://api.openai.com/v1/chat/completions'
header = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}

ingredients = ["Apple sauce", "Peanut butter", "Buttermilk", "yogurt", "Coconut", "Soybeans","kidney beans","Ground Flaxseeds", "cauliflower", "beans", "Chickpeas","Milk","lima beans"]
for ingredient in ingredients:
    data = {
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": f"functionality of {ingredient} as an egg replacement in bakery!"}],
        "temperature": 0.7
    }

    response = requests.post(url, headers=header, data=json.dumps(data))
    response_json = response.json()

    # Extracting the generated answer from the API response
    answer = response_json['choices'][0]['message']['content']

    # Print the answer with ingredient name in color
    colored_functionality = Fore.GREEN + f"Functionality of {ingredient} as an egg replacement in bakery:" + Style.RESET_ALL
    print(colored_functionality)
    print(answer)

    # Wait for 30 seconds before making another API request
    time.sleep(30)

    """
    This pieace of code can be transfered into a function, the name of the function can be
    'testing_pipeline'. the ingredients, API_key and url can be put in the function as its
    input and the answer can be the output of the function. 
    """
#=========================================================
"""
In general, it can be quite practical to have a 'main()' function to run all the mentioned
function inside it. Then, this function can be run through
if __name__() == '__main__':
    main(input values) 
"""

[32mFunctionality of Apple sauce as an egg replacement in bakery:[0m
Apple sauce is a commonly used egg replacement in vegan baking as it can help bind ingredients together and add moisture to the recipe. Here are some ways in which apple sauce can be used in place of eggs:

1. Binding: Applesauce can help bind ingredients together just like eggs. It works well in recipes that require one or two eggs. About 1/4 cup of applesauce can replace one egg in a recipe.

2. Moisture: Applesauce can add moisture to the recipe, which is crucial in baked goods. Moisture helps to keep the baked goods moist and tender. This is especially important in recipes that require a lot of dry ingredients like flour. 

3. Flavor: Applesauce adds a slightly sweet and fruity flavor to baked goods. This can work well in recipes like muffins, cakes, and quick breads. 

It is important to note that using applesauce may not work in all recipes. It is best to experiment with the amount of applesauce used in the re

As I mentioned multiple times during the code review, This code have a vital need of structure. One of the structure that can be used for this part of the pipeline is as follows. Many other implementations can be used, but I found the following more organized and maintaimable.

In [None]:
# example structure
class IntegratedData():
    def __init__():
        pass

    def making_file_paths_list():
        pass

    def making_integrated_data():
        pass

    def saving_file():
        pass

    @staticmethod
    def config():
        pass

class Graph():
    def __init__():
        pass

    def making_graph():
        pass

    def drawing_graph():
        pass

class Embedding():
    def __init__():
        pass

    def making_embedding_models():
        pass
    
    def making_embedding_dict():
        pass

class SimilarityMetrices(Embedding):
    def __init__():
        pass

    def most_similar_eucledian():
        pass

    def most_similar_cosine():
        pass

class dimensionalityReduction():
    def __init__():
        pass

    def t_sne():
        pass

    def drawing_reduced_dimensions():
        pass

class TestIngradients():
    def __init__():
        pass

    def egg():
        pass

if __name__() == '__main__':
    pass