<h1>Assignment 4</h1>

In [1]:
from dataclasses import dataclass
from typing import List, Optional

import numpy as np
import pandas as pd

ZIP_FILE_LINK = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

<h4>Loading data</h4>

In [2]:
# Fetch data from ZIP file link and store converted CSV
def fetch_data():
    import requests, zipfile, io
    r = requests.get(ZIP_FILE_LINK)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

# Check if the folder already exists
try:
    open('ml-latest-small/ratings.csv')
except FileNotFoundError:
    print('Fetching data...')
    fetch_data()

# Read CSV files
ratings = pd.read_csv('ml-latest-small/ratings.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
links = pd.read_csv('ml-latest-small/links.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


<p>First let's get helper functions for predicting movie ratings from the previous assignment and prepare data</p>

In [4]:
# Helper functions from the previous assignment

# min_common_percentage = 0.1
min_common_items = 2

# Implementation of Pearson correlation
def get_similarity_between_two_items(user1, user2):
    # Find common items
    common_items = user1.notna() & user2.notna()
    
    common_items_count = common_items.sum()
    if common_items_count == 0:
        return 0  # No common items, no correlation

    # Approach 2: Common items count
    if common_items_count < min_common_items:
        return 0  # Not enough common items for a meaningful correlation
    
    # Get the common items
    user1_common = user1[common_items]
    user2_common = user2[common_items]
    
    # Pearson correlation requires at least 2 common items
    if len(user1_common) < 2:
        return 0 
    
    # Calculate the Pearson correlation coefficient
    correlation = user1_common.corr(user2_common)
    
    if np.isnan(correlation):
        return 0  # Handle NaN values
    
    return max(correlation, 0)  # Return a non-negative correlation

def predict_rating(user_id, movie_id, ratings_by_users):
    # If the user has already rated the movie, return the known rating
    if not np.isnan(ratings_by_users.loc[user_id, movie_id]):
        return ratings_by_users.loc[user_id, movie_id]

    # Get the users who rated the movie
    users_who_rated = ratings_by_users[ratings_by_users[movie_id].notna()].index

    # Calculate the similarities and the weighted ratings
    similarities = [get_similarity_between_two_items(ratings_by_users.loc[user_id], ratings_by_users.loc[other_user_id]) for other_user_id in users_who_rated]
    weighted_ratings = [similarity * (ratings_by_users.loc[other_user_id, movie_id] - ratings_by_users.loc[other_user_id].mean()) for other_user_id, similarity in zip(users_who_rated, similarities)]

    # If no one else rated the movie, return the mean rating of the user
    if sum(similarities) == 0:
        return ratings_by_users.loc[user_id].mean()

    # Return the weighted average rating, ensuring it is within the range of 0 - 5
    return max(min((sum(weighted_ratings) / sum(similarities)) + ratings_by_users.loc[user_id].mean(), 5), 0)


In [5]:
# Copying the ratings dataframe to a new dataframe for further processing
movie_ratings = ratings.copy()

In [6]:
# Making a pivot table to get the ratings of each movie by each user
ratings_by_users = movie_ratings.pivot_table(index='userId', columns='movieId', values='rating', aggfunc='first')
ratings_by_users.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


<p>Implementing average agregation and least misery approaches for a set of users<p>

In [7]:
def group_rating_prediction(userIds, all_users_ratings, strategy='average'):
    # Get only the data about the users we are interested in
    group_users_ratings = all_users_ratings.loc[userIds]

    # Remove the movies that no one has rated
    group_users_ratings = group_users_ratings.dropna(axis=1, how='all')

    # Dataset to use for prediction
    dataset_for_prediction = all_users_ratings.copy()

    # Predict the individual ratings of the movies that the users haven't rated
    for user_id in userIds:
        for movie_id in group_users_ratings.columns:
            group_users_ratings.loc[user_id, movie_id] = predict_rating(user_id, movie_id, dataset_for_prediction)
    
    if(strategy == 'average'):
        # Get the average rating for every movie
        movie_ratings_average = group_users_ratings.mean(axis=0)

    if(strategy == 'least_misery'):
        # Get the least misery rating for every movie
        movie_ratings_average = group_users_ratings.min(axis=0)

    # Sort the movies by their average rating
    movie_ratings_average = movie_ratings_average.sort_values(ascending=False)

    return movie_ratings_average, group_users_ratings


**Designing and implementing** methods for producing explanations for group recommendations for the granularity case for both atomic (e.g., Why not Matrix?) and group (e.g., Why not action movies?) cases, as well as for position absenteeism case (e.g., Why not rank Matrix first?)

Our design is heavily based on the give paper (https://homepages.tuni.fi/konstantinos.stefanidis/docs/wise20.pdf), where there is a solution for methods to produce explanations for why not questions for single users. These methods are turned into explanations for group recommendations for **granularity** and **position absenteeism**. We only provide explanations for independent why not questions, constrast to the paper where there is solution for dependent questions, as in this assigment we only have to answer independent cases (Why not Matrix?, Why not action movies?, Why not rank Matrix first?)

As in the paper, we distinguish between **general explanations**, which can appear in any recommender system, and **model-specific explanations** that are based on the inherent parameters of the CF group recommendation model.

**General explanations:** 

1. The most common question is about an item that is not in the database, then evidently we provide the explanation that *the item is not suggested to the group because it is unkown for the system*. And the same in the group case when the question is a category, so a list of movies and no movie from this list was rated, we provide: *This list of movies (category) is unknown to the system*.

2. The item is in the system but it is not rated by any user, thus our explanation will be: *it is not recommended to the group because nobody from the database rated this item*. Same in the group case when no movie from the list was rated: *This category (list of movies) was not rated by any user in the database*.

3. Similarly, when no user from the group rated the movie in question, or movies in question we provide: *No user from the group rated this movie/category* thus it is not recommended, because we designed the system that we only recommend a movie if it is rated by at least 1 of the users in the group (this explanation is on the verge to be considered model specific explanation).

4. Another explanation emerges from the number of returned top-k items, the reason why we don't recommend a movie for the group could be because k is low, so we check if the item is in the extended recommendations, top 2k items, but if it's lower down then we don't consider this as the problem. So if it's in the top 2k item we explain: *You asked for few items, item i is in the n-th position*

If none of these are the problems then we provide a model specific explanation

**Model specific explanations:**

The concept behind CF group recommendations is that the system suggests items to a group based on what the group members have liked in the past. This makes all the possible explanations revolving around the ratings of the group members and the strategy that we used to aggregate these ratings. We provide explanations for the average and least misery strategies.

An answer for a "Why not item A" question thus can be concluded by analyzing the groupmembers' rating considering the strategy, so it can be concluded by the list of tuples of (user_id, score) and the given strategy.

1. If no groupmember gave a high score for the item we can give the explanation: *No one from the group liked the movie*

2. If the strategy is average then the reason why we did not recommend is because the average rating is not high enough, thus there are group members who did not like the item very much, so we provide the explanation: *n user from the group liked the movie, but m user disliked it* and provide the average rating. Similarly when the question is about a category (list of movies) then we aggregate the scores for each movie by user and provide the explanation that *n user from the group liked these movies (category) but m user disliked it*.

3. When the strategy is least misery the reason can be simple for both atomic and group case: A person from the group did not like this movie/category. Thus we can tell: *user i did not like this movie/category, he/she gave score x*


As in the paper, when a user questions the item’s ranking in the recommendation list, the system checks the same information. The system answers questions like: “Why was not item A ranked higher?” by explaining the item’s statistics: how many groupmember liked and how many disliked the movie. This type of question is vague, in the sense that the user questions the general ranking of an item without comparing it to another item; the user issued an independent question. So the system treats it as if it was a total absenteeism question. And as the we don't answer dependent questions because it was not part of the assignment, the system will give the same answer to at as a total absenteeism question.


In [8]:
@dataclass
class WhyNotQuestion:
    m: int # Item id
    pos: Optional[int] = None # Rank of the item in the recommendation list
    # dependency is not needed for this assignment
    # d: Optional[int] = None # The item id that m is dependent on

@dataclass
class ModelSpecificExplanations:
    group_member_id: int # The group member id
    score: float # The score of the group member for the item in question


def modelexplanation_to_string(explanations: List[ModelSpecificExplanations], strategy: str = 'average') -> str:
    # threshold for liking a movie
    like_treshold = 3.5

    # Count the number of likes, dislikes, average rating and the user with the lowest rating
    like_count = 0
    rat_sum = 0

    min_rating = 5
    min_rating_user = None

    for explanation in explanations:
        rat_sum += explanation.score
        if explanation.score >= like_treshold:
            like_count += 1
        if explanation.score < min_rating:
            min_rating = explanation.score
            min_rating_user = explanation.group_member_id
    
    if like_count == 0:
        return f"No one from the group liked the movie."
    elif strategy == 'average':
        return f"{like_count} people from the group liked and {len(explanations) - like_count} people disliked the movie.\nThe average rating is {rat_sum / len(explanations)}"
    elif strategy == 'least_misery':
        return f"User {min_rating_user} gave rating {min_rating} to the movie."


def group_modelexplanation_to_string(explanations: List[ModelSpecificExplanations], strategy='average') -> str:
    # threshold for liking a movie
    like_treshold = 3.5

    # get average rating by user
    user_ratings = {}
    for explanation in explanations:
        if explanation.group_member_id not in user_ratings:
            user_ratings[explanation.group_member_id] = []
        user_ratings[explanation.group_member_id].append(explanation.score)
    user_ratings_average = {k: sum(v) / len(v) for k, v in user_ratings.items()}

    # Count the number of likes, dislikes, average rating and the user with the lowest rating
    like_count = 0
    rat_sum = 0

    min_rating = 5
    min_rating_user = None

    for user, rating in user_ratings_average.items():
        rat_sum += rating
        if rating >= like_treshold:
            like_count += 1
        if rating < min_rating:
            min_rating = rating
            min_rating_user = user

    if like_count == 0:
        return f"No one from the group liked these movies."
    elif strategy == 'average':
        return f"{like_count} people from the group like and {len(user_ratings_average) - like_count} people dislike these movies.\nThe average rating is {rat_sum / len(user_ratings_average)}"
    elif strategy == 'least_misery':
        return f"User {min_rating_user} does not like these movies, his/her average rating is {min_rating}"


In [17]:
def wncf_group(item_set: set,
               group_members: List[int],
               why_not_questions: List[WhyNotQuestion],
               rating_scores: pd.DataFrame,
               group_rating_scores: pd.DataFrame,
               expanded_recommendation_list_for_group: pd.Series,
               strategy: str) -> str:
    if len(why_not_questions) == 0:
        return 'No questions to answer'
    elif len(why_not_questions) > 1:
        # item group case
        # None of the items exist in the database
        if not all([wn.m in item_set for wn in why_not_questions]):
            return f'None of the movies exist in the database'
        # None of the items have any rating that is not NaN
        elif all([rating_scores[wn.m].isna().all() for wn in why_not_questions]):
            return f'None of the movies have ratings'
        # None of the users from the group have rated any of the items
        elif all([rating_scores.loc[group_members, wn.m].isna().all() for wn in why_not_questions]):
            return f'None of the group members have rated any of these movies'
        else:
            # model specific answers
            explanations_for_group: List[ModelSpecificExplanations] = []
            for wn in why_not_questions:
                for group_member in group_members:
                    try:
                        explanations_for_group.append(ModelSpecificExplanations(group_member, group_rating_scores.loc[group_member, wn.m]))
                    except KeyError:
                        print(f'Noboby from the group has rated item {wn.m}')
                        break
            return group_modelexplanation_to_string(explanations_for_group, strategy)
    else:
        wn = why_not_questions[0]
        # Check if the item exists in the database
        if wn.m not in item_set:
            return f'Item {wn.m} does not exist in the database'
        # Check if the item has any rating that is not NaN
        elif rating_scores[wn.m].isna().all():
            return f'Item {wn.m} has no ratings'
        # Check if users from the group have rated the item
        elif rating_scores.loc[group_members, wn.m].isna().all():
            return f'None of the group members have rated item {wn.m}'
        elif wn.m in expanded_recommendation_list_for_group.index and wn.pos is None:
            return f'You asked for few items, {wn.m} is at position {expanded_recommendation_list_for_group.index.get_loc(wn.m) + 1}'
        else:
            # model specific answers
            explanations_for_group: List[ModelSpecificExplanations] = []
            for group_member in group_members:
                # Get the score of the group member for the item in question
                explanations_for_group.append(ModelSpecificExplanations(group_member, group_rating_scores.loc[group_member, wn.m]))
            return modelexplanation_to_string(explanations_for_group, strategy)


In [10]:
userIds = [1, 2, 3]
top_n = 10

In [11]:
expanded_recommendation_list_average, group_users_ratings_average = group_rating_prediction(userIds, ratings_by_users, strategy='average')
expanded_recommendation_list_least_misery, group_users_ratings_least_misery = group_rating_prediction(userIds, ratings_by_users, strategy='least_misery')

expanded_recommendation_list_average = expanded_recommendation_list_average.head(2*top_n)
expanded_recommendation_list_least_misery = expanded_recommendation_list_least_misery.head(2*top_n)

In [12]:
print("Average strategy recommendation list:")
print(expanded_recommendation_list_average)

Average strategy recommendation list:
movieId
70946    5.000000
3703     4.993518
1587     4.797554
5746     4.649425
5919     4.649425
6835     4.649425
5181     4.649425
2288     4.644323
7899     4.482759
5764     4.482759
4518     4.475006
101      4.466118
2502     4.371382
1222     4.358639
3441     4.352742
1275     4.349734
2851     4.318411
2529     4.316131
2959     4.313345
47       4.280672
dtype: float64


In [13]:
print("\nLeast misery strategy recommendation list:")
print(expanded_recommendation_list_least_misery)


Least misery strategy recommendation list:
movieId
70946    5.000000
3703     4.980555
1587     4.500000
4518     4.026903
2288     4.000000
2851     3.948276
6835     3.948276
5181     3.948276
5919     3.948276
5764     3.948276
5746     3.948276
7899     3.948276
3168     3.746670
3024     3.707828
1732     3.617876
441      3.568187
1275     3.500000
2028     3.494685
3441     3.479912
2648     3.471668
dtype: float64


In [14]:
# granularity atomic case

# Why not item 1?
why_not_question = WhyNotQuestion(m=1)
print(f'Why not item {why_not_question.m}?\n')

print(f'Average strategy: {wncf_group(set(ratings_by_users.columns), userIds, [why_not_question], ratings_by_users, group_users_ratings_average, expanded_recommendation_list_average, strategy="average")}\n')
print(f'Least misery strategy: {wncf_group(set(ratings_by_users.columns), userIds, [why_not_question], ratings_by_users, group_users_ratings_least_misery, expanded_recommendation_list_least_misery, strategy="least_misery")}\n')

Why not item 1?

Average strategy: 2 people from the group liked and 1 people disliked the movie.
The average rating is 3.6316621469985857

Least misery strategy: User 3 gave rating 2.7382175186880544 to the movie.



In [15]:
# granularity group case

# Why not items 1, 2 and 3?
why_not_questions = [WhyNotQuestion(m=1), WhyNotQuestion(m=2), WhyNotQuestion(m=3)]
print(f'Why not items {[wn.m for wn in why_not_questions]}?\n')

print(f'Average strategy: {wncf_group(set(ratings_by_users.columns), userIds, why_not_questions, ratings_by_users, group_users_ratings_average, expanded_recommendation_list_average, strategy="average")}\n')
print(f'Least misery strategy: {wncf_group(set(ratings_by_users.columns), userIds, why_not_questions, ratings_by_users, group_users_ratings_least_misery, expanded_recommendation_list_least_misery, strategy="least_misery")}\n')

Why not items [1, 2, 3]?

Noboby from the group has rated item 2
Average strategy: 2 people from the group like and 1 people dislike these movies.
The average rating is 3.4127635963326735

Noboby from the group has rated item 2
Least misery strategy: User 3 does not like these movies, his/her average rating is 2.3081684562125053



In [18]:
# position absenteeism

# Why not item 101 at position 1?
why_not_question = WhyNotQuestion(m=101, pos=1)
print(f'Why not item {why_not_question.m} at position {why_not_question.pos}?\n')

print(f'Average strategy: {wncf_group(set(ratings_by_users.columns), userIds, [why_not_question], ratings_by_users, group_users_ratings_average, expanded_recommendation_list_average, strategy="average")}\n')
print(f'Least misery strategy: {wncf_group(set(ratings_by_users.columns), userIds, [why_not_question], ratings_by_users, group_users_ratings_least_misery, expanded_recommendation_list_least_misery, strategy="least_misery")}\n')


Why not item 101 at position 1?

Average strategy: 2 people from the group liked and 1 people disliked the movie.
The average rating is 4.466117579052866

Least misery strategy: User 3 gave rating 3.3983527371585986 to the movie.

