# **Hummus - Community Based Recommendations**
Notebook for the first project for the Machine Learning Complements course (CAC).

## **Introduction**

The goal of this project is to develop a recommendation system for the Hummus dataset. The dataset contains information about users and their ratings for different recipes. The system should be able to recommend items to users based on their reviews.

### Imports

The following libraries will be used in this project:

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from networkx.algorithms.community import greedy_modularity_communities, girvan_newman, label_propagation_communities, fast_label_propagation_communities
from surprise.model_selection import train_test_split
from surprise import Dataset, Reader, KNNBasic, NormalPredictor, SVD, accuracy
import utils as ut
import warnings
warnings.simplefilter(action='ignore')

### Constants

In this section of the notebook, we define all the constants that will be used throughout the code. Constants are values that do not change and remain the same every time the code is run. Defining constants at the beginning of the code makes it easier to manage and modify them if needed.

In [None]:
VERBOSE = True # Logging
SAMPLES = 10000 # Number of samples to use, only applies if USE_SAMPLES = False
USE_SAMPLES = True # If False, retrieve data from full dataset
USE_STORED_GRAPH = False # If False, generate new graph, otherwise load from file

### Load Data

Our dataset was collected from food.com website. It contains information about users and their ratings/reviews for different recipes. The dataset is stored in 3 CSV files, which we will load into pandas DataFrames.

In [None]:
if USE_SAMPLES:
    df_members = pd.read_csv('pp_members_sampled.csv')
    df_recipes = pd.read_csv('pp_recipes_sampled.csv')
    df_reviews = pd.read_csv('pp_reviews_sampled.csv')
else:
    df_members = pd.read_csv('pp_members.csv')#, nrows=SAMPLES)
    df_recipes = pd.read_csv('pp_recipes.csv')#, nrows=SAMPLES)
    df_reviews = pd.read_csv('pp_reviews.csv', nrows=SAMPLES)


    df_members = df_members[df_members['member_id'].isin(df_reviews['member_id'])] # keep only members who have reviewed
    df_recipes = df_recipes[df_recipes['recipe_id'].isin(df_reviews['recipe_id'])] # keep only recipes that have been reviewed
    
    # Save the sampled data
    df_members.to_csv('pp_members_sampled.csv', index=False)
    df_recipes.to_csv('pp_recipes_sampled.csv', index=False)
    df_reviews.to_csv('pp_reviews_sampled.csv', index=False)

# Initial Oberservations

The dataset contains 3 files: `recipes.csv`, `users.csv`, and `reviews.csv`. The `recipes.csv` file contains information about the recipes, such as the recipe name, the recipe id, and the recipe ingredients. The `users.csv` file contains information about the users, such as the user id and the user name. The `reviews.csv` file contains information about the reviews, such as the user id, the recipe id, and the rating given by the user to the recipe.

In this section we will take a look at the first few rows of each file to get a better understanding of the data, and do some initial data exploration.

#### Initial Observation - Members dataset

In [None]:
ut.initial_obs(df_members)

In [None]:
df_members.describe()

#### Initial Observation - Recipes dataset

In [None]:
ut.initial_obs(df_recipes)

#### Initial Observation - Reviews dataset

In [None]:
ut.initial_obs(df_reviews)

#### Reviews distribution across rating

More than 95% of the reviews are positive, with a rating of 4 or 5. This is important to keep in mind when building the recommendation system, as it may be biased towards recommending popular recipes.

In [None]:
ut.plot_reviews_rating(df_reviews)

#### Distribution of number of reviews per user

Mainly one review per user, which is expected.

In [None]:
def plot_num_users_num_reviews(df):
    reviews_count = df.groupby('member_id')['review_id'].count()

    # Define bins for the histogram
    bins = [0, 5, 10, 15, reviews_count.max()+1]  # Bins for both ranges
    labels = ['1-5', '6-10', '11-15', '16-'+str(reviews_count.max())]  # Labels for bins

    # Calculate frequencies for each range
    frequencies = [((reviews_count > bins[i]) & (reviews_count <= bins[i+1])).sum() for i in range(len(bins)-1)]

    # Plot the distribution of ratings
    bars = plt.bar(labels, frequencies, color='skyblue', edgecolor='black')

    # Plot bar plot with pre-calculated frequencies
    plt.bar(labels, frequencies, color='skyblue', edgecolor='black')
    plt.bar_label(plt.bar(labels, frequencies, color='skyblue', edgecolor='black'))

    plt.title('Distribution of Number of Reviews per User')
    plt.xlabel('Number of Reviews')
    plt.ylabel('Number of Members')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()

plot_num_users_num_reviews(df_reviews)

#### Top 10 most popular reviewed recipes

Top 10 Recipes with more than 20 reviews.

In [None]:
# Calculate average rating for each recipe

# Filter recipes with more than 20 reviews
filtered_recipes = df_recipes[df_recipes['number_of_ratings'] > 20]

# Sort recipes based on average rating
top_rated_recipes = filtered_recipes.sort_values(by='average_rating', ascending=False).head(10)

# Print the name and rating of the top-rated recipes as well as the number of reviews
print('Top-Rated Recipes:')
print('------------------')

for index, recipe in top_rated_recipes.iterrows():
    print(f"{recipe['title']} (Recipe ID: {recipe['recipe_id']}) - Average Rating: {recipe['average_rating']:.2f} ({recipe['number_of_ratings']} reviews)")

#### Finishing the Initial Observations

In this section we have loaded the data, explored the first few rows of each file, and done some initial data exploration. We have also identified some key points that will be important to keep in mind when building the recommendation system.

- 90% of our reviews are positive, with a rating of 4 or 5.
- Most users have only reviewed one recipe.
- The dataset was collected and preprocessed before being provided to us, so there is no need for further preprocessing at this stage.

#### Initial Preparation - Create the graph for network analysis

We will start by doing a Social Network Analysis on the dataset. This will help us understand the relationships present in our dataset and finding patterns that can be used to make recommendations.

Starting by creating a graph with the members, as our main focus, as nodes and the reviews as edges. The weight of the edges will be the number of reviews in common (to the same recipe with the same sentiment-feeling) between the two members. This will allow us to use network analysis to find communities of members with similar tastes.

First we will group the reviews by recipe and evaluations, so we can extract the members that have connections.

In [None]:
# Group reviews by recipe and evaluation (>3, <=3)
grouped_reviews = df_reviews.groupby(['recipe_id', df_reviews['rating'] > 3])

# Create a dictionary to store relations between users
user_relations = {}

# Iterate through each group
for (recipe_id, is_positive_rating), group in grouped_reviews:
    # Extract user IDs for this recipe and evaluation
    if VERBOSE: print(recipe_id, is_positive_rating, group['member_id'].unique())
    user_ids = group['member_id'].unique()
    user_ids.sort()
    
    # Update relations between users for this recipe
    for i, user_id1 in enumerate(user_ids):
        for user_id2 in user_ids[i+1:]:
            # Check if there's an entry for this relation between users
            if (user_id1, user_id2) not in user_relations:
                if VERBOSE: print(f"Creating new relation between {user_id1} and {user_id2}")
                user_relations[(user_id1, user_id2)] = 0
            
            # Increment the relation count between the users based on the evaluation
            user_relations[(user_id1, user_id2)] += 2 if is_positive_rating else 1
            if VERBOSE: print(f"Relation between {user_id1} and {user_id2} has been incremented to {user_relations[(user_id1, user_id2)]}")

# Now user_relations contains relations between users
if VERBOSE: print("Size of user_relations:", len(user_relations))

Users with the same taste will have a high number in the relation, and users with different tastes will have a low number. 

We decided to give more weight on the positive reviews, as they defined more the taste of the users. (2 times more weight on positive reviews)

Here are the most strong relations:

In [None]:
sorted_dict = sorted(user_relations.items(), key=lambda item: item[1], reverse=True)

# Convert the sorted_dict to a DataFrame and split the tuple into two columns
sorted_df = pd.DataFrame(sorted_dict, columns=['member_ids', 'relationship_strength'])
sorted_df[['member_id1', 'member_id2']] = pd.DataFrame(sorted_df['member_ids'].tolist(), index=sorted_df.index)
sorted_df = sorted_df.drop(columns='member_ids')

# Merge with the members DataFrame to get the names for member_id1
merged_df = pd.merge(sorted_df, df_members[['member_id', 'member_name']], left_on='member_id1', right_on='member_id')

# Rename the member_name column to member_name1
merged_df = merged_df.rename(columns={'member_name': 'member_name1'})

# Merge again with the members DataFrame to get the names for member_id2
merged_df = pd.merge(merged_df, df_members[['member_id', 'member_name']], left_on='member_id2', right_on='member_id')

# Rename the member_name column to member_name2
merged_df = merged_df.rename(columns={'member_name': 'member_name2'})

# Print the 10 pairs with the strongest relationships along with their names and IDs
for index, row in merged_df[:10].iterrows():
    print((row['member_id1'], row['member_name1']), (row['member_id2'], row['member_name2']), ":", row['relationship_strength'])

Creating the graph...

In [None]:
g = nx.Graph()
vertex_indices = {}

# Check if the file exists
if USE_STORED_GRAPH and os.path.exists('graph_file.graphml'):
    # Load the graph from file
    if VERBOSE: print("Loading graph from file")
    g = nx.read_graphml('graph_file.graphml')
else:
    if VERBOSE: print("Creating new graph")
    for (u,v), weight in user_relations.items():
        g.add_edge(u, v, weight = weight)
    nx.write_graphml(g, "graph_file.graphml")
    
if VERBOSE: print(g)

connected_components = list(nx.connected_components(g))
largest_connected_component = max(connected_components, key=len)
g = g.subgraph(largest_connected_component)

# Use the spring layout
pos = nx.spring_layout(g)

# Draw the graph without labels
nx.draw(g, pos, with_labels=False, node_size=10)

# Show the plot
plt.show()

## Social Network Analysis

Before we start using models we will do some reasearch on our recently created network graph.

Some data about the graph:

In [None]:
print("Number of members:", g.number_of_nodes())
print("Number of connections:", g.number_of_edges())
print("Average degree:", sum(dict(g.degree()).values()) / g.number_of_nodes())
print("Graph density:", nx.density(g))

The graph itself is very sparse as the density is very low.

### Power Law Distribution

Here, we will investigate whether our network adheres to a power law distribution, which signifies a characteristic pattern in which a few nodes possess an exceptionally high number of connections, while the majority have only a few connections.

In [None]:
degree_sequence = sorted([d for n, d in g.degree()], reverse=True)
degree_count = np.unique(degree_sequence, return_counts=True)

# Plot degree distribution
plt.figure(figsize=(10, 6))
plt.scatter(degree_count[0], degree_count[1], marker='o', color='b', alpha=0.5)
plt.xscale('log')
plt.yscale('log')
plt.title("Degree Distribution")
plt.xlabel("Degree")
plt.ylabel("Number of Users")
plt.grid(True, which="both", ls="--")
plt.show()

As we can see, our network does follow a power law distribution. However, there are some outliers that might appear because we're only looking at a portion of the dataset.

### Small Worlds Theory

The small worlds theory posits that within complex networks, such as social networks or neural connections, most nodes can be reached from any other node via a small number of steps, highlighting the prevalence of short path lengths and high clustering coefficients.

In [None]:
eccentricities = nx.eccentricity(g)

#avg_shortest_path_length = nx.average_shortest_path_length(g)
#print(avg_shortest_path_length)

As we can see by the results, it takes approximately 3 nodes (small) to reach one node from any other.

### Graph Diameter

The diameter of a graph is the length of the shortest path between the most distanced nodes.

In [None]:
diameter = max(eccentricities.values())
print(f"Diameter of the graph: {diameter}")

In [None]:
center_nodes = [node for node, eccentricity in eccentricities.items() if eccentricity == diameter]
source_node = center_nodes[0]
target_node = center_nodes[1]
diameter_path = nx.shortest_path(g, source=source_node, target=target_node)

# Draw the graph
pos = nx.spring_layout(g)  # Positions for all nodes

# Draw nodes not in the diameter path
non_diameter_nodes = [node for node in g.nodes() if node not in diameter_path]
nx.draw_networkx_nodes(g, pos, nodelist=non_diameter_nodes, node_size=50, alpha=0.2)

# Draw non-diameter edges
non_diameter_edges = [edge for edge in g.edges() if edge not in [(diameter_path[i], diameter_path[i + 1]) for i in range(len(diameter_path) - 1)]]
nx.draw_networkx_edges(g, pos, edgelist=non_diameter_edges, width=1, alpha=0.2)

# Draw nodes in the diameter path with bigger size
nx.draw_networkx_nodes(g, pos, nodelist=diameter_path, node_size=200, node_color="red")

# Draw diameter path with thicker line width
diameter_edges = [(diameter_path[i], diameter_path[i + 1]) for i in range(len(diameter_path) - 1)]
nx.draw_networkx_edges(g, pos, edgelist=diameter_edges, edge_color='red', width=2)



plt.title(f"Graph Diameter: {diameter}")
plt.show()

### Most Influencial Users

In this section, we'll employ various statistical measures to extract insights about our data, particularly focusing on identifying influential users. To achieve this, we will compute different centrality metrics including degree centrality, betweenness centrality, eigenvector centrality, PageRank and closeness centrality for the top 10 users in each category.

#### Degree Centrality

A user with high degree centrality likely reviews a large number of recipes. They may be very active in providing feedback on various recipes, indicating a strong engagement with the platform or community. They might have a significant influence on others in the network, potentially influencing their choices of recipes for others to try.
Degree Centrality measures how connected a node is by counting its direct connections, akin to popularity in a social network, where individuals with more friends are considered more central.

In [None]:
AMOUNT_USERS = 10

degree_centrality = nx.degree_centrality(g)
degree_centrality = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)

degree_centrality_members = df_members[df_members['member_id'].isin([int(x[0]) for x in degree_centrality[0:5]])]

# Add number of reviews to the top users
degree_centrality_members = degree_centrality_members.merge(df_reviews.groupby('member_id').size().reset_index(name='number_of_reviews'), on='member_id')

degree_centrality_members = degree_centrality_members[['member_id', 'member_name', 'number_of_reviews', 'follow_me_count']]
degree_centrality_members['number_of_connections'] = [g.degree(x[0]) for x in degree_centrality[0:5]]

# Add number of connections to the DataFrame
degree_centrality_members['degree_centrality'] = [x[1] for x in degree_centrality[0:5]]
degree_centrality_members.head(AMOUNT_USERS)

#### Closeness Centrality

Closeness Centrality quantifies how quickly a node can interact with others, similar to being centrally located in a city where reaching any destination is efficient due to proximity. This measure calculates the average length of the shortest path from a node to all other nodes in the graph. In your case, a user with high closeness centrality can reach all other users through a short path of shared recipe reviews.


In [None]:
# closeness_centrality = nx.closeness_centrality(g, distance='weight')
# closeness_centrality = sorted(closeness_centrality.items(), key=lambda x: x[1], reverse=True)

# closeness_centrality_members = df_members[df_members['member_id'].isin([int(x[0]) for x in closeness_centrality[0:5]])]

# # Add number of reviews to the top users
# closeness_centrality_members = closeness_centrality_members.merge(df_reviews.groupby('member_id').size().reset_index(name='number_of_reviews'), on='member_id')

# closeness_centrality_members = closeness_centrality_members[['member_id', 'member_name', 'number_of_reviews', 'follow_me_count]]
# closeness_centrality_members['number_of_connections'] = [g.degree(x[0]) for x in closeness_centrality[0:5]]

# # Add closeness centrality to the DataFrame
# closeness_centrality_members['closeness_centrality'] = [x[1] for x in closeness_centrality[0:5]]

# closeness_centrality_members.head(AMOUNT_USERS)

#### Betweenness Centrality

Betweenness Centrality evaluates the importance of a node in facilitating communication between other nodes, resembling a bridge in a network where traffic flows through it, making it crucial for connectivity. This measure calculates the number of shortest paths from all nodes to all others that pass through a node. In your case, a user with high betweenness centrality often acts as a bridge between other users' recipe reviews.


In [None]:
# betweenness_centrality = nx.betweenness_centrality(g)
# betweenness_centrality = sorted(betweenness_centrality.items(), key=lambda x: x[1], reverse=True)

# betweenness_centrality_members = df_members[df_members['member_id'].isin([int(x[0]) for x in betweenness_centrality[0:5]])]

# # Add number of reviews to the top users
# betweenness_centrality_members = betweenness_centrality_members.merge(df_reviews.groupby('member_id').size().reset_index(name='number_of_reviews'), on='member_id')

# betweenness_centrality_members = betweenness_centrality_members[['member_id', 'member_name', 'number_of_reviews', 'follow_me_count]]
# betweenness_centrality_members['number_of_connections'] = [g.degree(x[0]) for x in betweenness_centrality[0:5]]

# # Add betweenness centrality to the DataFrame
# betweenness_centrality_members['betweenness_centrality'] = [x[1] for x in betweenness_centrality[0:5]]

# betweenness_centrality_members.head(AMOUNT_USERS)

#### Eigen Centrality

Eigen centrality assesses a node's importance by considering not only its direct connections but also the importance of its neighbors, similar to a ripple effect where influence spreads from influential nodes throughout the network. This measure calculates a node's influence based on the number of links it has to other influential nodes. In your case, a user with high eigenvector centrality has reviewed many of the same recipes as other influential users.


In [None]:
eigen_centrality = nx.eigenvector_centrality(g, weight='weight')
eigen_centrality = sorted(eigen_centrality.items(), key=lambda x: x[1], reverse=True)

eigen_centrality_members = df_members[df_members['member_id'].isin([int(x[0]) for x in eigen_centrality[0:5]])]

# Add number of reviews to the top users
eigen_centrality_members = eigen_centrality_members.merge(df_reviews.groupby('member_id').size().reset_index(name='number_of_reviews'), on='member_id')

eigen_centrality_members = eigen_centrality_members[['member_id', 'member_name', 'number_of_reviews', 'follow_me_count']]
eigen_centrality_members['number_of_connections'] = [g.degree(x[0]) for x in eigen_centrality[0:5]]

# Add eigenvector centrality to the DataFrame
eigen_centrality_members['eigen_centrality'] = [x[1] for x in eigen_centrality[0:5]]

eigen_centrality_members.head(AMOUNT_USERS)

#### Page Rank

PageRank algorithm, assesses a node's importance based on the quantity and quality of its connections. Originally used by Google to rank websites, it's similar to eigenvector centrality, but it involves a damping factor which represents the probability of a user explore reviews from users they are connected to.

In [None]:
pagerank = nx.pagerank(g, weight='weight')
pagerank = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)

pagerank_members = df_members[df_members['member_id'].isin([int(x[0]) for x in pagerank[0:5]])]

# Add number of reviews to the top users
pagerank_members = pagerank_members.merge(df_reviews.groupby('member_id').size().reset_index(name='number_of_reviews'), on='member_id')

pagerank_members = pagerank_members[['member_id', 'member_name', 'number_of_reviews', 'follow_me_count']]
pagerank_members['number_of_connections'] = [g.degree(x[0]) for x in pagerank[0:5]]

# Add PageRank to the DataFrame
pagerank_members['pagerank'] = [x[1] for x in pagerank[0:5]]

pagerank_members.head(AMOUNT_USERS)

### HITS (Hubs and Authorities)

HITS algorithm, also known as hubs and authorities, evaluates the importance of a node based on two factors: its authority, which is the number of hubs that point to it, and its hub value, which is the number of authorities it points to. This algorithm is particularly useful for identifying the most influential nodes in a network.

As we are working with an undirected graph, we will see equal values for both hubs and authorities, so we will only show the authorities.

In [None]:
# Run the HITS algorithm
hubs, authorities = nx.hits(g)

sorted_authorities = sorted(authorities.items(), key=lambda x: x[1], reverse=True)

# Create DataFrames for the top 5 hubs and authorities
authorities_members = df_members[df_members['member_id'].isin([int(x[0]) for x in sorted_authorities[0:5]])]

# Add number of reviews to the top users
authorities_members = authorities_members.merge(df_reviews.groupby('member_id').size().reset_index(name='number_of_reviews'), on='member_id')

authorities_members = authorities_members[['member_id', 'member_name', 'number_of_reviews', 'follow_me_count']]

authorities_members['number_of_connections'] = [g.degree(x[0]) for x in sorted_authorities[0:5]]

authorities_members['authority_score'] = [x[1] for x in sorted_authorities[0:5]]

In [None]:
authorities_members.head(AMOUNT_USERS)

We can see that some users are present in all the top 5 lists, which means they are very influential in the network. This is important to keep in mind when building the recommendation system, as these users may have a significant impact on the recommendations.

In [None]:
# Combine all the dataframes
all_metrics_df = pd.concat([
    degree_centrality_members, 
    # closeness_centrality_members, 
    # betweenness_centrality_members, 
    eigen_centrality_members, 
    pagerank_members,
    authorities_members
    ])

# Count the occurrences of each user
user_counts = all_metrics_df['member_id'].value_counts()
most_influential_users = user_counts[user_counts == user_counts.max()]

# Convert most_influential_users to DataFrame
most_influential_users_df = most_influential_users.reset_index()
most_influential_users_df.columns = ['member_id', 'count']

# Merge with df_members to get user names
most_influential_users_names = pd.merge(most_influential_users_df, df_members[['member_id', 'member_name', 'follow_me_count', 'member_joined']], on='member_id', how='left')

print(most_influential_users_names)

### Rich get richer

In this section, we will investigate the dilema rich-get-richer in our network. By analyzing the member_joined date we will expect to observe some behaviour on the 'aging' of the users.

In [None]:
# check date distribution across years, display the quartils 25%, 50%, 75%
print(df_members['member_joined'].quantile([.25, .5, .75]))

This is a common behaviour in social networks, where the more time a user is in the network, the more connections he will have, as it will be more influencial.

### Will the reviews of the most influential users be on the top 20 recipes?

In [None]:
print('Top-Rated Recipes:')
print('------------------')

for index, recipe in top_rated_recipes.iterrows():
    print(f"{recipe['title']} (Recipe ID: {recipe['recipe_id']}) - Average Rating: {recipe['average_rating']:.2f} ({recipe['number_of_ratings']} reviews)")

In [None]:
# reviews of the most influential users
most_influential_users_reviews = df_reviews[df_reviews['member_id'].isin(most_influential_users_names['member_id'])]

# Find the recipes that match with top_rated_recipes and print their id, name, avg_rating and number of reviews
top_rated_recipes = top_rated_recipes[['recipe_id', 'title', 'average_rating', 'number_of_ratings']]
most_influential_users_reviews = most_influential_users_reviews.merge(top_rated_recipes, on='recipe_id')

most_influential_users_reviews[['recipe_id', 'title', 'average_rating', 'number_of_ratings']].head()

Yes!!! Our prediction was correct. We conclude the most influential users have their influence in our network as expected.

### Community Detection

Now we will detect communities in our network. Communities are groups of nodes that are more densely connected to each other than to the rest of the network. Detecting communities can help us identify groups of users with similar tastes, which can be useful for making recommendations. We will assess the algorithm's performance by calculating a few metrics.

#### Louvain Algorithm

In [None]:
louvain_communities = greedy_modularity_communities(g, weight='weight')
for i, community in enumerate(louvain_communities):
    print(f"Community {i + 1}: {len(community)}")

#### Label Propagation Algorithm

In [None]:
label_prop_communities = list(fast_label_propagation_communities(g, weight='weight'))
label_prop_communities = sorted(label_prop_communities, key=lambda x: len(x), reverse=True)

for i, community in enumerate(label_prop_communities):
    print(f"Community {i + 1}: {len(community)}")

#### Models evaluation

We will evaluate the performance of the models by calculating the modularity score. The modularity measures the strength of the division of the network into communities.

In [None]:
from networkx.algorithms.community.quality import modularity

louvain_modularity_score = modularity(g, louvain_communities) if len(louvain_communities) > 0 else 0
print("Louvain Modularity:", louvain_modularity_score)

label_modularity_score = modularity(g, label_prop_communities) if len(label_prop_communities) > 0 else 0
print("Label Propagation Modularity:", label_modularity_score)

communities = louvain_communities if louvain_modularity_score > label_modularity_score else label_prop_communities
print("Using communities from Louvain algorithm" if louvain_modularity_score > label_modularity_score else "Using communities from Label Propagation algorithm")

### Community Filtering
Here we will be removing the communities with very few users

In [None]:
# average number of users in a community
average_users = sum([len(x) for x in communities]) / len(communities)
print("Average Amount Users p/ Community: ", average_users)

filtered_communities = [c for c in communities if len(c) >= average_users]
for i, community in enumerate(filtered_communities):
    print(f"Community {i + 1}: {len(community)}")

### Visualizing the Communities

We will visualize the communities the algorithms found. Each community will have a different color for better visualization. They will be displayed, firstly, all together and then separated.

In [None]:
filtered_nodes = [node for community in filtered_communities for node in community]

# Create a subgraph with only the nodes in the filtered connected component
sg = g.subgraph(filtered_nodes)

# Assign a color to each community
community_colors = [plt.cm.rainbow(i/len(filtered_communities)) for i in range(len(filtered_communities))]

# Create a list of colors, one for each node in the filtered connected component
colors = []
for node in filtered_nodes:
    for i, community in enumerate(filtered_communities):
        if node in community:
            # Add the color to the list of colors
            colors.append(community_colors[i])
            break

# Draw the graph
plt.figure(figsize=(15, 15))
nx.draw_networkx(sg, with_labels=False, node_color=colors, node_size=50)
plt.show()

In [None]:
# Calculate the number of rows and columns for the grid
n = len(filtered_communities)
cols = int(np.ceil(np.sqrt(n)))
rows = int(np.ceil(n / cols))

# Create a figure and axes for the grid
fig, axs = plt.subplots(rows, cols, figsize=(15, 15))

# Assign a color to each community
# community_colors = [plt.cm.rainbow(i/n) for i in range(n)]

for i, community in enumerate(filtered_communities):
    # Create a subgraph with only the nodes in the current community
    sg = g.subgraph(community)

    # Create a list of colors, one for each node in the community
    colors = [community_colors[i]] * len(community)

    # Draw the graph on the corresponding axes
    ax = axs[i // cols, i % cols]
    ax.set_title(f"Community {i + 1}")
    nx.draw_networkx(sg, with_labels=False, node_color=colors, node_size=50, ax=ax)

# Remove empty subplots
for i in range(n, rows*cols):
    fig.delaxes(axs.flatten()[i])

plt.tight_layout()
plt.show()

## Recommender System

Now that we have identified communities of users with similar tastes, we can use this information to build a recommendation system. We will use the communities to make recommendations to users based on the recipes that other users in the same community have reviewed positively.

In [None]:
communities_rmse = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}
communities_mae = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}
communities_precision = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}
communities_recall = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}
whole_dataset_mae = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}
whole_dataset_rmse = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}
whole_dataset_precision = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}
whole_dataset_recall = {"Random Recommender":0, "User-Based CF":0, "Item-Based CF":0, "Model-Based CF":0, "Content-Based Filtering":0}

models_predictions = {}

filtered_users = [int(user) for sublist in filtered_communities for user in sublist]

df_members = df_members[df_members['member_id'].isin(filtered_users)]
df_reviews = df_reviews[df_reviews['member_id'].isin(filtered_users)]
df_recipes = df_recipes[df_recipes['recipe_id'].isin(df_reviews['recipe_id'])]

print("Shape of Filtered Members:", df_members.shape)
print("Shape of Filtered Reviews:", df_reviews.shape)
print("Shape of Filtered Recipes", df_recipes.shape)

### Collaborative Filtering (Applied @ each community)

In this section, we'll be exploring recommender systems that help suggest items based on similarities between users or items. We'll dive into both user-based and item-based collaborative filtering methods. Our aim is to apply these techniques to different communities, assess how well they work for each, and then gauge their overall performance by averaging the errors. 

#### Memory-Based

##### User-Based

We will predict a user's preferences based on the preferences of similar users (users in the same community).

In [None]:
avg_rmse, avg_mae, avg_precision, avg_recall = ut.collaborative_filtering(df_reviews, filtered_communities, model_type='KNN')

In [None]:
print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage RMSE ->\033[0m", avg_rmse)
print(f"\033[1mAverage MAE ->\033[0m", avg_mae)
print(f"\033[1mAverage Precision@10 ->\033[0m", avg_precision)
print(f"\033[1mAverage Recall@10 ->\033[0m", avg_recall)
print()    

communities_rmse["User-Based CF"] = avg_rmse
communities_mae["User-Based CF"] = avg_mae
communities_precision["User-Based CF"] = avg_precision
communities_recall["User-Based CF"] = avg_recall

##### Item-Based

This time we will use a recommendation approach that predicts a user's preferences by examining similarities between items rather than users.

In [None]:
avg_rmse, avg_mae, avg_precision, avg_recall = ut.collaborative_filtering(df_reviews, filtered_communities, user_based=False, model_type='KNN')

In [None]:
print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage RMSE ->\033[0m", avg_rmse)
print(f"\033[1mAverage MAE ->\033[0m", avg_mae)
print(f"\033[1mAverage Precision@10 ->\033[0m", avg_precision)
print(f"\033[1mAverage Recall@10 ->\033[0m", avg_recall)
print()    

communities_rmse["Item-Based CF"] = avg_rmse
communities_mae["Item-Based CF"] = avg_mae
communities_precision["Item-Based CF"] = avg_precision
communities_recall["Item-Based CF"] = avg_recall

Our graph is very sparse, so this models are not expected to be the best.

### Cold-Start Problem

In collaborative filtering approaches we can face the cold-start problem, where there may be insufficient data about users or items to make accurate recommendations.

#### Popularity model (Naive Approach)

To tackle the cold-start problem, we will use a popularity model. This model will recommend the most popular recipes to users who do not have enough reviews to make personalized recommendations.

In [None]:
# Sort by "average_rating" and number_of_ratings > 30
df_recipes_top = df_recipes[df_recipes['number_of_ratings'] > 30]
df_recipes_top = df_recipes_top.sort_values(by='average_rating', ascending=False)

# Get top N recommendations
top_n_popularity = df_recipes_top.head(10)

# Print the top N recommended items
print("\nTop Recommendations using Popularity Model:")
for index, recipe in top_n_popularity.iterrows():
    print(f"ID: {recipe['recipe_id']}, Title: {recipe['title']}, Average Rating: {recipe['average_rating']:.2f}, Number of Ratings: {recipe['number_of_ratings']}")

#### Model-Based

We will employ model-based collaborative filtering for personalized recommendations, contrasting with memory-based methods. Unlike memory-based approaches that directly compare user-item interactions, model-based methods utilize mathematical models to capture underlying patterns and relationships in the data.

In [None]:
avg_rmse, avg_mae, avg_precision, avg_recall = ut.collaborative_filtering(df_reviews, filtered_communities, model_type='SVD')

In [None]:
print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage RMSE ->\033[0m", avg_rmse)
print(f"\033[1mAverage MAE ->\033[0m", avg_mae)
print(f"\033[1mAverage Precision@10 ->\033[0m", avg_precision)
print(f"\033[1mAverage Recall@10 ->\033[0m", avg_recall)
print()    

communities_rmse["Model-Based CF"] = avg_rmse
communities_mae["Model-Based CF"] = avg_mae
communities_precision["Model-Based CF"] = avg_precision
communities_recall["Model-Based CF"] = avg_recall

#### Content-based Filtering 

In this section, we'll be exploring recommender systems that suggest items based on similarities between the characteristics of items. We'll delve into content-based filtering methods, which recommend items to users based on the similarity of the items' features or attributes. Our aim is to apply these techniques to different communities, assess how well they work for each community, and then evaluate their overall performance.

By vectorizing text based features we can find similar recipes.

In [None]:
all_recommendations = ut.find_similars(df_reviews, df_recipes)

for recipe_id, recipe_data in all_recommendations.items():
    print(f"\033[1mOriginal Recipe: {recipe_data['original_title']} (Recipe ID: {recipe_id})\033[0m")
    print("Similar Recipes:")
    unique_similar_recipe_ids = set()  # Track unique similar recipe IDs for each original recipe
    for similar_recipe in recipe_data['similar_recipes']:
        if similar_recipe['id'] not in unique_similar_recipe_ids:
            print(f"- {similar_recipe['title']} (Recipe ID: {similar_recipe['id']}) | Similarity Score: {similar_recipe['score']:.2f}")
            unique_similar_recipe_ids.add(similar_recipe['id'])
    print()

In [None]:
df_similar_recipes = ut.create_similar_recipes_dataframe(all_recommendations)
print(df_similar_recipes)

In [None]:
avg_similar_recipes_per_recipe, avg_similar_score = ut.calculate_average_similarity(df_similar_recipes)
print(f"\033[1mAverage Similar Recipes per Recipe:\033[0m {avg_similar_recipes_per_recipe:.2f}")
print(f"\033[1mAverage Similarity Score:\033[0m {avg_similar_score:.2f}")

As can be seen there few pairs of similar recipes, which should translate to a small number of recomendations.

In [None]:
avg_accuracy, avg_recall = ut.content_based_filtering(df_reviews, df_similar_recipes, filtered_communities)
print(f"\033[1mAverage Precision ->\033[0m", avg_precision)
print(f"\033[1mAverage Recall ->\033[0m", avg_recall)

Despite an high similarity between ratings given by the users to similar recipes, it is rare that the same user reviews similar recipes.

##### Content-based User Profiles 

In [None]:
user_profiles = ut.user_profiles(filtered_communities, df_reviews, df_recipes, df_members)
print(user_profiles)

After building user profiles we can clearly analyze each user favourite ingredients.

In [None]:
top_ingredients = ut.get_top_favorite_ingredients(user_profiles, 813262) 
print(top_ingredients)

We can also employ the same strategy to find out each community favourite ingredients.

In [None]:
top_favorite_per_community = ut.get_top_favorite_ingredients_per_community(user_profiles, filtered_communities)
for community, top_ingredients in top_favorite_per_community.items():
    print("Top favorite ingredients for Community", community, ":")
    print(top_ingredients)
    print()

With this data we can now find out the favourite recipes of each community and rank them.

In [None]:
# Assuming df_recipes contains the recipes DataFrame and top_favorite_per_community is the dictionary containing top favorite ingredients for each community
recommendations_per_community = ut.community_recipe_recommendations(df_recipes, top_favorite_per_community)
for community, recommendations in recommendations_per_community.items():
    print("Recommendations for Community", community, ":")
    print(recommendations)
    print()

These favourite recipes can be recommended to the users of that own community, constituting a community-profile content-based approach.

In [None]:
precision_at_10, recall_at_10 = ut.evaluate_recommendations(recommendations_per_community, df_reviews, filtered_communities)
print("Average Precision@10:", precision_at_10)
print("Average Recall@10:", recall_at_10)

### Random Recommender

To have a term of comparison, we will also use a random recommender. This model will recommend random recipes to users who do not have enough reviews to make personalized recommendations.

In [None]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_reviews[['member_id', 'recipe_id', 'rating']], reader)
trainset, testset = train_test_split(data)

In [None]:
random_algo = NormalPredictor()
rmse, mae, predictions, precision, recall = ut.evaluate_model(random_algo, trainset, testset)

print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage RMSE ->\033[0m", rmse)
print(f"\033[1mAverage MAE ->\033[0m", mae)
print(f"\033[1mAverage Precision@10 ->\033[0m", precision)
print(f"\033[1mAverage Recall@10 ->\033[0m", recall)
print()    

whole_dataset_rmse["Random Recommender"] = rmse
whole_dataset_mae["Random Recommender"] = mae
whole_dataset_precision["Random Recommender"] = precision
whole_dataset_recall["Random Recommender"] = recall
models_predictions["Random Recommender"] = predictions

In [None]:
# run random for communities
avg_rmse, avg_mae, avg_precision, avg_recall = ut.collaborative_filtering(df_reviews, filtered_communities, model_type='Random')
communities_rmse["Random Recommender"] = avg_rmse
communities_mae["Random Recommender"] = avg_mae
communities_precision["Random Recommender"] = avg_precision
communities_recall["Random Recommender"] = avg_recall

### Collaborative Filtering (Applied @ whole data)

#### Memory-Based

##### User-Based

In [None]:
ubcf_algo = KNNBasic(sim_options={'user_based': True}, verbose=False)
rmse, mae, pred_user, precision, recall = ut.evaluate_model(ubcf_algo, trainset, testset)

print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage RMSE ->\033[0m", avg_rmse)
print(f"\033[1mAverage MAE ->\033[0m", avg_mae)
print(f"\033[1mAverage Precision@10 ->\033[0m", avg_precision)
print(f"\033[1mAverage Recall@10 ->\033[0m", avg_recall)
print()    

whole_dataset_rmse["User-Based CF"] = rmse
whole_dataset_mae["User-Based CF"] = mae
whole_dataset_precision["User-Based CF"] = precision
whole_dataset_recall["User-Based CF"] = recall
models_predictions["User-Based CF"] = pred_user

##### Item-Based

In [None]:
ibcf_algo = KNNBasic(sim_options={'user_based': False}, verbose=False)
rmse, mae, pred_item, precision, recall = ut.evaluate_model(ibcf_algo, trainset, testset)

print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage RMSE ->\033[0m", rmse)
print(f"\033[1mAverage MAE ->\033[0m", mae)
print(f"\033[1mAverage Precision@10 ->\033[0m", precision)
print(f"\033[1mAverage Recall@10 ->\033[0m", recall)
print()    

whole_dataset_rmse["Item-Based CF"] = rmse
whole_dataset_mae["Item-Based CF"] = mae
whole_dataset_precision["Item-Based CF"] = precision
whole_dataset_recall["Item-Based CF"] = recall
models_predictions["Item-Based CF"] = pred_item

#### Model-Based

In [None]:
svd_algo = SVD(verbose = False)
rmse, mae, pred_model, precision, recall = ut.evaluate_model(ibcf_algo, trainset, testset)

print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage RMSE ->\033[0m", rmse)
print(f"\033[1mAverage MAE ->\033[0m", mae)
print(f"\033[1mAverage Precision@10 ->\033[0m", precision)
print(f"\033[1mAverage Recall@10 ->\033[0m", recall)
print()  

whole_dataset_rmse["Model-Based CF"] = rmse
whole_dataset_mae["Model-Based CF"] = mae
whole_dataset_precision["Model-Based CF"] = precision
whole_dataset_recall["Model-Based CF"] = recall
models_predictions["Model-Based CF"] = pred_model

### Content-Based Filtering (Applied @ whole data)

In [None]:
avg_precision, avg_recall = ut.overall_content_based_filtering(df_reviews, df_similar_recipes, filtered_communities)

print(f"\033[1m-----Overall Performance-----\033[0m")
print(f"\033[1mAverage Precision ->\033[0m", avg_precision)
print(f"\033[1mAverage Recall ->\033[0m", avg_recall)
print()   

Evaluation

In [None]:
# RMSE and MAE values for each model with communities and whole dataset
# Models
models = ["Random Recommender", 
          "User-Based CF", 
          "Item-Based CF",
          "Model-Based CF", 
          "Content-Based Filtering"
          ]
indices = np.arange(len(models))

# RMSE and MAE values for each model with communities and whole dataset
communities_rmse_values = [communities_rmse[model] for model in models]
communities_mae_values = [communities_mae[model] for model in models]
communities_precision_values = [communities_precision[model] for model in models]
communities_recall_values = [communities_recall[model] for model in models]
whole_dataset_rmse_values = [whole_dataset_rmse[model] for model in models]
whole_dataset_mae_values = [whole_dataset_mae[model] for model in models]
whole_dataset_precision_values = [whole_dataset_precision[model] for model in models]
whole_dataset_recall_values = [whole_dataset_recall[model] for model in models]


# Create subplots for RMSE and MAE values
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(12, 25))
bar_width = 0.4


# Plot RMSE values
for i, value in enumerate(communities_rmse_values):
    axes[0].text(i - bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')
    
for i, value in enumerate(whole_dataset_rmse_values):
    axes[0].text(i + bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')
    
axes[0].bar(indices - 0.2, communities_rmse_values, width=bar_width, color='lightsalmon', alpha=0.6, label='Communities RMSE')
axes[0].bar(indices + 0.2, whole_dataset_rmse_values, width=bar_width, color='wheat', alpha=0.6, label='Whole Dataset RMSE')
axes[0].set_xticks(indices)
axes[0].set_xticklabels(models, rotation=30)
axes[0].set_ylabel('RMSE')
axes[0].set_title('RMSE Values')
axes[0].legend()

axes[0].set_ylim(0, max(communities_rmse_values + whole_dataset_rmse_values) + 0.3)

# Plot MAE values
for i, value in enumerate(communities_mae_values):
    axes[1].text(i - bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')
    
for i, value in enumerate(whole_dataset_mae_values):
    axes[1].text(i + bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')
    
axes[1].bar(indices - 0.2, communities_mae_values, width=bar_width, color='cornflowerblue', alpha=0.6, label='Communities MAE')
axes[1].bar(indices + 0.2, whole_dataset_mae_values, width=bar_width, color='seagreen', alpha=0.6, label='Whole Dataset MAE')
axes[1].set_xticks(indices)
axes[1].set_xticklabels(models, rotation=30)
axes[1].set_ylabel('MAE')
axes[1].set_title('MAE Values')
axes[1].legend()

axes[1].set_ylim(0, max(communities_mae_values + whole_dataset_mae_values) + 0.3)

# Plot Precision values
for i, value in enumerate(communities_precision_values):
    axes[2].text(i - bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')

for i, value in enumerate(whole_dataset_precision_values):
    axes[2].text(i + bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')

axes[2].bar(indices - 0.2, communities_precision_values, width=bar_width, color='lightsalmon', alpha=0.6, label='Communities Precision')
axes[2].bar(indices + 0.2, whole_dataset_precision_values, width=bar_width, color='wheat', alpha=0.6, label='Whole Dataset Precision')
axes[2].set_xticks(indices)
axes[2].set_xticklabels(models, rotation=30)
axes[2].set_ylabel('Precision')
axes[2].set_title('Precision Values')
axes[2].legend()

axes[2].set_ylim(0, max(communities_precision_values + whole_dataset_precision_values) + 0.3)

# Plot Recall values
for i, value in enumerate(communities_recall_values):
    axes[3].text(i - bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')
    
for i, value in enumerate(whole_dataset_recall_values):
    axes[3].text(i + bar_width/2, value - 0.01, str(round(value, 3)), ha='center', va='top')
    
axes[3].bar(indices - 0.2, communities_recall_values, width=bar_width, color='cornflowerblue', alpha=0.6, label='Communities Recall')
axes[3].bar(indices + 0.2, whole_dataset_recall_values, width=bar_width, color='seagreen', alpha=0.6, label='Whole Dataset Recall')
axes[3].set_xticks(indices)
axes[3].set_xticklabels(models, rotation=30)
axes[3].set_ylabel('Recall')
axes[3].set_title('Recall Values')
axes[3].legend()

axes[3].set_ylim(0, max(communities_recall_values + whole_dataset_recall_values) + 0.3)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

plt.show()

### Conclusions

After analysing the results of the different models, we can conclude that the best model for our dataset is the collaborative filtering model, more specifically Model-Based. This model achieved the lowest RMSE and MAE values, indicating that it is the most accurate model for making recommendations to users based on the recipes they have reviewed.

This was expected as our dataset is very sparse, and the memory-based models are not expected to perform well in this scenario. The content-based model also performed well, but it is limited by the features available in the dataset, and it may not be able to capture all the nuances of the users' tastes.