Counterfactual explanations systematically remove some items from the user's
previous interactions and then call on the recommendation system again to check if the
item they want to provide an explanation for is removed from the user's suggestions. If
the item is successfully removed, then the set of items responsible for that alteration is
the explanation provided to the user. This would result in an explanation of the form: ''If
you had not liked item A, then item B would not have been suggested,'' where A is a set
of items removed from the user's feedback, and B is the item the user wanted an
explanation for.

In this part of the project, the goal is to produce counterfactual explanations for a group
of users. Specifically, design (10 points) and implement (10 points) a method that
generates counterfactual explanations that adhere to some characteristics to be more
in tune with group recommendations. For example, an explanation that only consists of
items interacted by a single user is undesirable since it would single out that user to the
rest of the group. Ideally, we would like explanations that consist of items most users
have interacted with to make the explanation fairer. In this context, it means that no
user should have changed their preferences for the group, but the group as a whole is
responsible for the suggestions provided by the system. Prepare also a short
presentation (about 5 slides) to show how your method works (5 points).

In [24]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform
from itertools import combinations
from datetime import datetime

In [25]:
links = pd.read_csv('ml-latest-small/links.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings_df = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')

print("Links Dataset:")
display(links.head())

print("\nMovies Dataset:")
display(movies.head())

print("\nRatings Dataset:")
display(ratings.head())

print("\nTags Dataset:")
display(tags.head())

rating_count = ratings.shape[0]
print(f"\nTotal number of ratings: {rating_count}")

Links Dataset:


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0



Movies Dataset:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy



Ratings Dataset:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931



Tags Dataset:


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200



Total number of ratings: 100836


In [29]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Create item and user sets
item_set = set(ratings_df['movieId'].unique())
user_set = set(ratings_df['userId'].unique())

# Rating scores in a dictionary format for fast lookup
rating_scores = ratings_df.groupby('userId').apply(
    lambda df: df[['movieId', 'rating']].set_index('movieId')['rating'].to_dict(), include_groups=False
).to_dict()

# Function to calculate cosine similarity between users
def calculate_cosine_similarity(user_ratings, all_ratings):
    # Convert ratings to a user-item matrix
    user_item_matrix = all_ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)
    # Compute cosine similarity
    cosine_sim_matrix = cosine_similarity(user_item_matrix)
    return pd.DataFrame(cosine_sim_matrix, index=user_item_matrix.index, columns=user_item_matrix.index)

# Get peers based on cosine similarity with a threshold
similarity_df = calculate_cosine_similarity(ratings_df, ratings_df)
user = 1
threshold = 0.8
peers = similarity_df.loc[user][similarity_df.loc[user] > threshold].index.tolist()
peers.remove(user)  # Remove the user themselves from the peers list

# Define Moderate and Active Users
moderate_users = ratings_df['userId'].value_counts().loc[lambda x: (x >= 45) & (x <= 55)].index.tolist()
active_users = ratings_df['userId'].value_counts().loc[lambda x: (x >= 145) & (x <= 155)].index.tolist()

# Define movie sets based on popularity
movie_popularity = ratings_df['movieId'].value_counts()
movies2k = movie_popularity.nsmallest(100).index.tolist()
movies4k = movie_popularity[movie_popularity.between(2000, 4000)].nlargest(100).index.tolist()
movies6k = movie_popularity[movie_popularity.between(4000, 6000)].nlargest(100).index.tolist()
movies8k = movie_popularity.nlargest(100).index.tolist()

# Example relevance score function (dummy, replace with actual model output)
relevance_score_fn = lambda u, i: 3.5  # Dummy value

# WNCF function as previously defined
def WNCF(item_set, user_set, user, why_not_question, rating_scores, recommendation_list, threshold_numP, peers, relevance_score_fn):
    # Initialize an explanation list
    e = []
    item = why_not_question[0]

    # Step 1: Check if the item is in the item set
    if item not in item_set:
        e.append('I')

    # Step 2: Check for ties with items already in the recommendation list
    elif any(relevance_score_fn(user, item) == relevance_score_fn(user, i0) and recommendation_list.index(i0) < len(recommendation_list)
             for i0 in recommendation_list):
        e.append('Tie')

    # Step 3: Check if item is within the first 2k entries of an expanded recommendation list
    elif item in recommendation_list[:2 * len(recommendation_list)]:
        e.append('k')

    # Step 4: Check if the item has no ratings in the score list
    elif item not in rating_scores:
        e.append('S')

    # Step 5: Check if at least one peer of the user has rated the item
    elif any(peer in peers and item in rating_scores.get(peer, {}) for peer in peers):
        for peer in peers:
            if item in rating_scores.get(peer, {}):
                e.append((peer, rating_scores[peer][item], similarity_df.loc[user, peer]))
        if len([peer for peer in peers if item in rating_scores.get(peer, {})]) < threshold_numP:
            e.append('numP')
        if len([peer for peer in peers]) < threshold_numP:
            e.append('numPI')

    # Step 6: Check if any user has rated the item
    else:
        for other_user in user_set:
            if item in rating_scores.get(other_user, {}):
                e.append((other_user, rating_scores[other_user][item], '-'))
        e.append('Peers')

    # Return the explanation
    return e

# Example usage
user = 1
recommendation_list = [1, 2, 3]  # Example movie IDs
threshold_numP = 3
why_not_question = (4,)  # Movie ID not in recommendation list

explanation = WNCF(item_set, user_set, user, why_not_question, rating_scores, recommendation_list, threshold_numP, peers, relevance_score_fn)
print(explanation)




['Tie']
