
*UE Learning from User-generated Data, CP MMS, JKU Linz 2025*
# Exercise 1: RecSys Basics II
In this exercise we familiarize ourselves with recommender systems, a kind of data they use and implement a simple base-line recommendation algorithm.

The assignment submission deadline is 11.03.2025 12:00.
Please, don't forget to rename your Jupyter Notebook according to the convention:<br>

LUD25_ex01_**k**<font color='red'>12312127</font>_<font color='red'>Kaufmann-Lukas</font>.ipynb

for example:

LUD25_ex01_**k**0000007_Bond-James.ipynb

## Introduction
* What are recommender systems?
* Where do we encounter them?
* What part does User-generated Data play in RecSys?

## Recommendation Scenario
Imagine a platform where users consume items: buy goods (Amazon), listen to music tracks (Deezer, Spotify), watch movies (Netflix) or videos (YouTube).

At some point a user may face a choice: "what item should I have a look at next?" Can be that they don't know what exactly they need and are unable to formulate a query. Of course with catalogs of millions of items they have little chance finding something useful by browsing through all of them.

In such situations Recommender Systems are expected to make the decision easier for the user by shrinking the scope to a handful of individually selected options, for example top 10 recommended songs.

Information the recommendation can be based on:
* Items already consumed by the user
* Items consumed by other users
* User relations
* Item meta-data & content
* ...

User-Item interactions is one of the most widely used signals in recommendation. Initially it can be available in a form of system logs (see table below). There is a multitude of ways a user can interact with an item: consume (buy, watch, listen), note (like, save to favorites), share and others. In this exercise we only deal with item consumption.

#### Example: Raw User-Item Interactions Data
| Meaningless but Unique<br>Event Id | User Id | Item Id | Event Type | Date |
| ---         |---  |--- |---   |   ---    |
| 002Ax4gf... | 12  | 2  | 6000 | 13.04.08 |
| 9f2D4jKx... | 908 | 2  | 6000 | 01.02.09 |
| 3g6lP89qs.. | 12  | 13 | 4800 | 11.10.10 |
| ...         | ... |... | ...  | ...      |

## Datasets
Throughout the whole exercise track we will be mostly working on music & movie recommendation tasks. Note that all methods we consider are applicable to other domains!

[LFM-2b](http://www.cp.jku.at/datasets/LFM-2b/) is a large dataset of over two billion listening events, spanning across ~15 years, crawled from LastFM platform. It is supported with user demographics information and music track meta-data. In this exercise we take a look at a small sample of the aggregated LFM-2B (lfm-tiny) as well as MovieLens-1M dataset (ml-1m). Each of them consists of three files, in case of lfm-tiny it is:

* 'lfm-tiny.inter' - data about user-track interactions;
* 'lfm-tiny.item' - track-related information;
* 'lfm-tiny.user' - user-related information;
    
And for ml-1m respectively:
    
* 'ml-1m.inter' - data about user-movie ratings;
* 'ml-1m.item' - movie-related information;
* 'ml-1m.user' - user-related information;

### Important note:
**'lfm-tiny.inter'** contains cumulative number of listening events per pair User-Track over the whole period;
    
**'ml-1m.inter'** contains **ratings** per pair User-Track on a scale from 1 to 5;

**The interpretation of the interaction feedback (listening events or ratings) is usually up to the designer of a recommender system. In this course we treat any feedback as IMPLICIT FEEDBACK**.

## Implicit feedback
The MovieLens dataset provides us with ratings users give to movies, this allows us to judge if a user (dis)likes one movie more than the other.
    
However, we do not always have explicit information about whether a user likes or dislikes a certain item (and to which extent). In case of LFM-2B we only know how many times a user have interacted with a certain item. The fact of a single interaction with a track does not mean that the user enjoyed it, so how many interactions do we need to be sure?
    
Following the concept of **Implicit Feedback** we binarise our interaction data: to every pair **User** and **Item** we assign 1 - user sufficiently (enough times or gave a sufficiently high rating) interacted with the item **or** 0 - user did not interact with the item / did not enjoy it / unknown.
    
Very roughly speaking recommendation with implicit feedback is a binary classification problem: prediction of whether the user is going to interact with an item or not.

## <font color='red'>TASKS</font>:

Implement functions specified below. Please, don't change the signatures (names, parameters) and follow the specifications closely. Your implementation should not require any additional imports, apart from those already in the notebook.

For testing purposes make sure the two dataset folders are placed in the same folder next to the .ipynb

In [3]:
import pandas as pd
import numpy as np

## TASK 1/3: Interaction Matrix (4 points)
Interaction matrix is a common data structure used in some (not all) recommender algorithms. It is a matrix with dimensions: [number of users] times [number of items] known to the system. Every element in the matrix shows whether the given User ever interacted with the given Item. It can be done in a binary manner, as a probability or as a rating given by the User to the Item.

**Write a function** that is able to create an interaction matrix from both ml-1m and lfm-tiny datasets. It receives three dataframes, dataset name and an int threshold value as input and returns a 2-dimensional numpy array with the corresponding interaction matrix, where **0** means the user didn't interact with the track on purpose or didn't like it (played the track \< [threshold] times, or gave a rating \< [threshold] in the case of ml-1m), **1** means the user listened to the track more than or equal to [threshold] times (or gave a rating that is higher or equal to [threshold] in the case of ml-1m).

The first dimension of the matrix should correspond to users, second - to items.

**Important note:** we introduce the threshold as a way to filter out interactions that are not necessarily meaningful (e.g. accidental playbacks or movies a user disliked).

Insert your solution into the signature below. Please, don't change the name or the argument set, even if they are not pretty.

In [None]:
def inter_matr_implicit(users: pd.DataFrame,
                        items: pd.DataFrame,
                        interactions: pd.DataFrame,
                        dataset_name: str,
                        threshold=1) -> np.ndarray:
    '''
    users - pandas Dataframe, use it as loaded from the dataset;
    items - pandas Dataframe, use it as loaded from the dataset;
    interactions - pandas Dataframe, use it as loaded from the dataset;
    dataset_name - string out of ["lfm-tiny", "ml-1m"], name of the dataset, used in case there are differences in the column names of the data frames;
    threshold - int > 0, criteria of a valid interaction

    returns - 2D np.ndarray, rows - users, columns - items;
    '''


    if dataset_name == "lfm-tiny":
        user_col = "user_id"
        item_col = "item_id"
        interaction_col = "listening_events"
    elif dataset_name == "ml-1m":
        user_col = "user_id"
        item_col = "item_id"  
        interaction_col = "rating"
    else:
        raise ValueError("Dataset name must be 'lfm-tiny' or 'ml-1m'")
    
    # User- und Item-Mappings erstellen
    user_ids = users[user_col].unique()
    item_ids = items[item_col].unique()
    
    user_map = {uid: i for i, uid in enumerate(user_ids)}
    item_map = {iid: i for i, iid in enumerate(item_ids)}
    
    # Interaktionsmatrix initialisieren
    interaction_matrix = np.zeros((len(user_ids), len(item_ids)), dtype=np.int8)
    
    # Interaktionsdaten filtern (nur relevante Spalten)
    interactions_filtered = interactions[[user_col, item_col, interaction_col]]
    
    # Nur Interaktionen über dem Schwellenwert behalten
    interactions_filtered = interactions_filtered[interactions_filtered[interaction_col] >= threshold]
    
    # Eintragen der Interaktionen in die Matrix
    for _, row in interactions_filtered.iterrows():
        user_idx = user_map.get(row[user_col])
        item_idx = item_map.get(row[item_col])
        if user_idx is not None and item_idx is not None:
            interaction_matrix[user_idx, item_idx] = 1  # Binäre Interaktion
    
    return interaction_matrix



In [6]:
# load the data for both datasets, keep it as specified in the csv files
def read(dataset, file):
    return pd.read_csv(dataset + '/' + dataset + '.' + file, sep='\t')

users_lfm = read("lfm-tiny", 'user')
items_lfm = read("lfm-tiny", 'item')
interactions_lfm = read("lfm-tiny", 'inter')

users_ml = read("ml-1m", 'user')
items_ml = read("ml-1m", 'item')
interactions_ml = read("ml-1m", 'inter')

### Check your solution:
Run your function on the data discussed above and make sure that the result is correct

In [8]:
# Creates interaction matrix for LFM dataset, choose the correct threshold for this dataset
_interaction_matrix_test_lfm = inter_matr_implicit(users_lfm, items_lfm, interactions_lfm, "lfm-tiny", threshold=1)

In [9]:
# Test your solution, assert will print a message if something is wrong, no message if everything is correct
assert _interaction_matrix_test_lfm is not None, "Interaction Matrix should not be None!"
assert type(_interaction_matrix_test_lfm) == np.ndarray, "Interaction Matrix should be a numpy array!"
assert _interaction_matrix_test_lfm.shape == (1194, 412), "Shape of Interaction Matrix is wrong!"
assert np.array_equal(np.unique(_interaction_matrix_test_lfm),
                      [0, 1]), "Interaction Matrix should only contain 0 and 1!"

In [10]:
# Creates interaction matrix for Movielens dataset, choose the correct threshold for this dataset
_interaction_matrix_test_ml = inter_matr_implicit(users_ml, items_ml, interactions_ml, "ml-1m", threshold=1)

In [11]:
# Test your solution, assert will print a message if something is wrong, no message if everything is correct
assert _interaction_matrix_test_ml is not None, "Interaction Matrix should not be None!"
assert type(_interaction_matrix_test_ml) == np.ndarray, "Interaction Matrix should be a numpy array!"
assert _interaction_matrix_test_ml.shape == (6040, 3883), "Shape of Interaction Matrix is wrong!"
assert np.array_equal(np.unique(_interaction_matrix_test_ml), [0, 1]), "Interaction Matrix should only contain 0 and 1!"

## TASK 2/3: POP Recommender (4 points)
One of the most straightforward approaches to recommendation -- recommending the most popular items to every user. We call such recommender POP. It is a useful (and quite strong) baseline for creating more sophisticated systems and can be a default recommender, when there is no data available to build the recommendation upon (for example if the user has just joined the platform and haven't interacted with anything yet). Through the whole exercise track we only recommend items **not seen** by the user before (repeated consumption is out of our scope).

**Write a function** that recommends [K] most popular items to a given user, **making sure that the user hasn't seen any of the recommended items before.**

The function should take three arguments: np.array of arbitrary dimensions (supporting any number of users and items) in the format from task 1 (interaction matrix), user ID (int) and K (int > 0).
Expected return: a list or a 1D array with [K] IDs of most popular items (sorted in the order of descending popularity) **not seen** by the user.

Insert your solution into the signature below. Please, don't change the name or the argument set, even if they are not beautiful.

In [13]:
def recTopKPop(inter_matr: np.ndarray,
               user: int,
               top_k: int) -> np.array:
    '''
    inter_matr - np.ndarray, from the task 1;
    user - int, user_id;
    top_k - int, expected length of the resulting list;

    returns - list/array, of top K popular items that the user has never seen
              (sorted in the order of descending popularity);
    '''

    top_pop = None

    # Compute item popularity by summing interactions across all users
    item_popularity = np.sum(inter_matr, axis=0)
    
    # Get indices of items sorted by popularity in descending order
    sorted_items = np.argsort(item_popularity)[::-1]
    
    # Get items the user has already interacted with
    seen_items = set(np.where(inter_matr[user] > 0)[0])
    
    # Filter out seen items and select the top K popular ones
    top_pop = [item for item in sorted_items if item not in seen_items][:top_k]
    
    return np.array(top_pop)


### Check your solution:
Run your function on the interaction matrix prepared before, make sure the input/output is correctly formatted.<br>
Get the <b>top 10</b> recommendations for <b>user 0</b>.
What are the tracks recommended to them? Would you like such recommendation? Will <b>user 0</b> like it?

In [15]:
# TODO: YOUR IMPLEMENTATION
top_10 = recTopKPop(_interaction_matrix_test_lfm, 0, 10)

In [16]:
# Test your solution, assert will print a message if something is wrong, no message if everything is correct
assert type(top_10) == np.ndarray, "Output should be an array."
assert len(top_10) == 10, "Length is not right."
# these recommendations are correct for the lfm dataset, comment it out if you are using ml-1m
assert np.array_equal(top_10, np.array([ 42,  43,  51,  96, 105, 151,  12, 104,  68, 150])), "Wrong recommendations."
# these recommendations are correct for the ml-1m dataset, comment it out if you are using lfm
# assert np.array_equal(top_10, np.array([2789, 1178, 1192, 476, 585, 2502,  589, 1539, 1180, 108])), "Wrong recommendations."

## Task 3/3: POP Recommender Country (2 points)
Use what you have learned in Task 2 and implement a new version of POP recommender that to each user recomends top_k unseen tracks, most popular among users from the same country as the user.

The function needs to figure out the country of the target user (the one receiving recommendations), see what is popular in that country, and then recommend top items not seen by the user before.

Please note the additional parameter **users**, Dataframe consisting of user data with a "country" column (think of the .user file in the dataset), use it well.

In [46]:
def recTopKPopByCountry(inter_matr: np.ndarray,
                        user: int,
                        top_k: int,
                        users: pd.DataFrame) -> np.ndarray:
    '''
    inter_matr - np.ndarray, from the task 1;
    user - int, user_id;
    top_k - int, expected length of the resulting list;
    users: pandas DataFrame, consisting of user information for all users, requires a "country" column;

    returns - list/array of top K popular items that the user has never seen
              (sorted in the order of descending popularity);
    '''
    
    # Create a mapping from user_id to matrix index
    user_id_to_idx = {uid: idx for idx, uid in enumerate(users['user_id'].unique())}
    
    if user not in user_id_to_idx:
        raise ValueError("User ID not found in users DataFrame")
    
    user_idx = user_id_to_idx[user]
    user_country = users.loc[users['user_id'] == user, 'country'].values[0]
    
    # Get users in the same country
    country_users = users.loc[users['country'] == user_country, 'user_id'].values
    country_user_indices = [user_id_to_idx[uid] for uid in country_users if uid in user_id_to_idx]
    
    # Compute item popularity within the country
    country_interactions = inter_matr[country_user_indices, :]
    item_popularity = np.sum(country_interactions, axis=0)
    
    # Sort items by popularity in descending order, breaking ties with item index
    sorted_items = np.argsort(-item_popularity + np.arange(len(item_popularity)) * 1e-9)
    
    # Get items the user has already interacted with
    seen_items = set(np.where(inter_matr[user_idx, :] > 0)[0])
    
    # Filter out seen items and select the top K popular ones
    top_pop = [item for item in sorted_items if item not in seen_items][:top_k]
    
    return np.array(top_pop)


Check your solution:
Run your function on the interaction matrix prepared before, make sure the input/output is correctly formatted.  Get the top 10 recommendations for user 0. What are the tracks recommended to them? Would you like such recommendation? Will user 0 like it?

In [49]:
inter_matr_lfm = inter_matr_implicit(users_lfm, items_lfm, interactions_lfm, "lfm-tiny", threshold=1)
# create a pandas Dataframe with user data that has at least a "country column"
users = users_lfm
top_10 = recTopKPopByCountry(inter_matr=inter_matr_lfm, user=0, top_k=10, users=users)

In [51]:
# Test your solution, assert will print a message if something is wrong, no message if everything is correct
assert type(top_10) == np.ndarray, "Output should be an array."
assert len(top_10) == 10, "Length is not right."
assert np.array_equal(top_10, np.array([43, 42, 69, 30, 96, 33, 51, 11, 71, 65])), "Wrong recommendations."

AssertionError: Wrong recommendations.

### Final check
* Your functions are going to be tested in isolation, make sure you don't use global variables;
* Remove all the code you don't need, provide comments for the rest;
* Check the execution time of your functions, if any of them takes more than one minute to execute on the given data, try optimizing it. Extremely inefficient solutions will get score penalties;
* Don't forget to rename the notebook before submission;


In [None]:
# Leave this cell the way it is, please.