
*UE Learning from User-generated Data, CP MMS, JKU Linz 2022*
# Exercise 1: Basics of Recommender Systems
In this exercise we familiarize ourselves with recommender systems, a kind of data they use and implement a simple base-line recommendation algorithm.

The assignment submission deadline is 22.03.2022 12:00.
Please, don't forget to rename your Jupyter Notebook according to the convention:<br>

LUD22_ex01_k<font color='red'>\<Matr. Number\></font>_<font color='red'>\<Surname-Name\></font>.ipynb

for example:

LUD22_ex01_k000007_Bond-James.ipynb

## Introduction
* What are recommender systems?
* Where do we encounter them?
* What part does User-Generated Data play in RecSys?

## Recommendation Scenario
Imagine a platform where users consume items: buy goods (Amazon), listen to music tracks (Deezer, Spotify), watch movies (Netflix) or videos (YouTube).

At some point a user may face a choice: "what item should I have a look at next?" Can be that they don't know what exactly they need and cannot formulate a query. Of course with catalogs of millions of items they have little chance finding something useful by browsing through all of them.

In such situation Recommender Systems are expected to make the decision easier for the user by shrinking the scope to a handful of individually selected options, for example top 10 recommended songs.

Information the recommendation can be based on:
* Items already consumed by the user
* Items consumed by other users
* User relations
* Item meta-data & content
* ...

User-Item interactions is one of the most widely used signals in recommendation. Initially it can be available in a form of system logs (see table below). There is a multitude of ways a user can interact with an item: consume (buy, watch, listen), note (like, save to favorites), share and other. In this exercise we only deal with item consumption.

#### Example: Raw User-Item Interactions Data
| Meaningless but Unique<br>Event Id | User Id | Item Id | Event Type | Date |
| ---         |---  |--- |---   |   ---    |
| 002Ax4gf... | 12  | 2  | 6000 | 13.04.08 |
| 9f2D4jKx... | 908 | 2  | 6000 | 01.02.09 |
| 3g6lP89qs.. | 12  | 13 | 4800 | 11.10.10 |
| ...         | ... |... | ...  | ...      |

## LFM-2b Sample
Throughout the whole exercise track we will be mostly working on music recommendation task. Note that all methods we consider are applicble to other domains!

[LFM-2b](http://www.cp.jku.at/datasets/LFM-2b/) is a large dataset of over two billion listening events, spanning across ~15 years, crawled from LastFM platform. It is supported with user demographics information and music track meta-data. In this exercise we take a look at a small sample of the aggregated dataset. It consists of three files:

* 'sampled_1000_items_inter.txt' - data about user-track interactions;
* 'sampled_1000_items_tracks.txt' - track-related information;
* 'sampled_1000_items_demo.txt' - user-related information;

'sampled_1000_items_inter.txt'<br>
Contains cumulative number of listening events per pair User-Track over the whole period.
    
| User Id | Track Id | Number of Interactions | 
| ---    |   ---  |   ---  |
| 0 | 0 | 3  |
| 0 | 6 | 5 |
| 2 | 17 | 8 |
| ... | ... | ... |

'sampled_1000_items_tracks.txt'<br>
Track-related information (line index, starting from zero, is the **Track ID**):

| Artist | Track Name |
| ---    |   ---  |
| Helstar | Harsh Reality |
| Carpathian Forest | Dypfryst / Dette Er Mitt Helvete |
| Cantique Lépreux | Tourments Des Limbes Glacials |
| ... | ... |

'sampled_1000_items_demo.txt'<br>
User-related information (line index, starting from zero, is the **User ID**):

| Location | Age | Gender | Reg. Date |
|   ---  |   ---  |   ---  |   ---  |
| BR | 25  | m | 2007-10-12 18:42:00 |
| UK | 27 | m | 2006-11-17 16:51:56 |
| US | 32 | m | 2010-02-02 22:30:15 |
| ... | ... | ... | ... |

All files are in .tsv (tab '**\t**' separated values) format.

## <font color='red'>TASKS</font>:

Implement functions specified below. Please, don't change the signatures (names, parameters) and follow the specifications closely. Your implementation should not require any additional imports, apart from those already in the notebook.

For testing purposes make sure the three data files mentioned above are placed in the same folder as the .ipynb

In [1]:
import pandas as pd
import numpy as np

## TASK 1: Interaction Matrix (4 points)
Interaction matrix is a common data structure used in some (not all) recommender algorithms. It is a matrix with dimensions: [number of users] times [number of items] known to the system. Every element in the matrix shows whether the given User ever interacted with the given Item. It can be done in a binary manner, as a probability or as a rating given by the User to the Item.

**Write a function** that receives three file names as input and returns a 2-dimensional numpy array with the corresponding interaction matrix, where **0** means the user didn't interact with the track on purpose or didn't like it (played the track \< [threshold] times), **1** means the user listened to the track more than or equal to [threshold] times.

The first dimension of the matrix should correspond to users, second - to items.

**Important note:** we introduce the threshold as a way to filter out accidental playbacks. Even if a user played the track only once the listening event is still reflected in the LFM-2b dataset. Usuall threshold value is about 2.

Insert your solution into the signature below. Please, don't change the name or the argument set, even if they are not pretty.

In [2]:
def inter_matr_binary(usr_path = 'sampled_1000_items_demo.txt',
                      itm_path = 'sampled_1000_items_tracks.txt',
                      inter_path = 'sampled_1000_items_inter.txt',
                      threshold = 1)  -> np.ndarray:
    '''
    usr_path - string path to the file with users data;
    itm_path - string path to the file with item data;
    inter_path - string path to the file with interaction data;
    threshold - int > 0, criteria of a valid interaction
    
    returns - 2D np.array, rows - users, columns - items;
    '''
    
    # Read files
    usr = pd.read_csv(usr_path, sep="\t", header=None).values
    itm = pd.read_csv(itm_path, sep="\t", header=None).values
    inter = pd.read_csv(inter_path, sep="\t", header=None).values
    
    # Create interaction matrix
    res = np.zeros(shape=(len(usr), len(itm)))
    for interaction in inter:
        res[interaction[0], interaction[1]] = int(interaction[2] >= threshold)
    
    return res

### Check your solution:
Run your function on the data discussed above and make sure that the result is correct

In [3]:
# TODO: YOUR IMPLEMENTATION
_interaction_matrix_test = inter_matr_binary()

In [4]:
assert _interaction_matrix_test is not None, "Interaction Matrix should not be None!"
assert type(_interaction_matrix_test) == np.ndarray, "Interaction Matrix should be a numpy array!"
assert _interaction_matrix_test.shape == (1194, 412), "Shape of Interaction Matrix is wrong!"
assert np.array_equal(np.unique(_interaction_matrix_test), [0, 1]), "Interaction Matrix should only contain 0 and 1!"

## TASK 2/2: POP Recommender (4 points)
One of the most straightforward approaches to recommendation -- recommending the most popular items to every user. We call such recommender POP. It is a useful baseline for creating more sophisticated systems and can be a default recommender, when there is no data available to build the recommendation upon (for example if the user has just joined the platform and haven't interacted with anything yet). Throught the whole exercise track we only recommend items not seen by the user before (repeated consumption is out of our scope).

**Write a function** that recommends [K] most popular items to a given user, **making sure that the user hasn't seen any of the recommended items before.**

The function should take three arguments: np.array of arbitrary dimensions (supporting any number of users and items) in the format from task 1 (interaction matrix), user ID (int) and K (int > 0).
Expected return: a list or a 1D array [K] IDs of most popular items (sorted in the order of descending popularity) not seen by the user.

Insert your solution into the signature below. Please, don't change the name or the argument set, even if they are not beautiful.

In [5]:
def recTopKPop(inter_matr: np.array,
               user: int,
               top_k: int) -> np.array:
    '''
    inter_matr - np.array from the task 1;
    user - user_id, integer;
    top_k - expected length of the resulting list;
    
    returns - list/array of top K popular items that the user has never seen
              (sorted in the order of descending popularity);
    '''
    
    # Calculate popularity of items
    item_pop = np.zeros(shape=(inter_matr.shape[1],))
    for item_id in range(len(item_pop)):
        
        # if user has already listened to specific track, leave its popularity at 0
        if inter_matr[user, item_id] != 0:
            continue
        
        item_pop[item_id] = inter_matr[:, item_id].sum()
    
    # Return indices of top 10 songs
    top_pop = np.argsort(item_pop)[:-top_k-1:-1]
                                     
    return top_pop

### Check your solution:
Run your function on the interaction matrix prepared before, make sure the input/output is correctly formatted.<br>
Get the <b>top 10</b> recommendations for <b>user 0</b>.
What are the tracks recommended to them? Would you like such recommendation? Will <b>user 0</b> like it?

In [6]:
# TODO: YOUR IMPLEMENTATION
top_10 = recTopKPop(inter_matr_binary(), 0, 10)

In [7]:
assert type(top_10) == np.ndarray, "Output should be an array."
assert len(top_10) == 10, "Length is not right."
assert np.array_equal(top_10, np.array([42, 43, 51, 96, 105, 151, 12, 104, 68, 150])), "Wrong recommendations."

### Final check
* Your functions are going to be tested in isolation, make sure you don't use global variables;
* Remove all the code you don't need, provide comments for the rest;
* Check the execution time of your functions, if any of them takes more than one minute to execute on the given data, try optimizing it. Extremely inefficient solutions will get score penalties;
* Don't forget to rename the notebook before submission;


In [8]:
# Leave this cell the way it is, please.