<a href="https://colab.research.google.com/github/nmarkin/Rec-Sys-Okko/blob/main/baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Configuration

In [28]:
# links to shared data MovieLens
# source on kaggle: https://www.kaggle.com/code/quangnhatbui/movie-recommender/data
RATINGS_SMALL_URL = 'https://drive.google.com/file/d/1BlZfCLLs5A13tbNSJZ1GPkHLWQOnPlE4/view?usp=share_link'
MOVIES_METADATA_URL = 'https://drive.google.com/file/d/19g6-apYbZb5D-wRj4L7aYKhxS-fDM4Fb/view?usp=share_link'

# 1. Modules and functions

In [29]:
import numpy as np
import pandas as pd

from itertools import islice, cycle, product

import warnings
warnings.filterwarnings('ignore')

## 1. 1. Helper functions to avoid copy paste

In [30]:
def read_csv_from_gdrive(url):
    """
    gets csv data from a given url (taken from file -> share -> copy link)
    :url: example https://drive.google.com/file/d/1BlZfCLLs5A13tbNSJZ1GPkHLWQOnPlE4/view?usp=share_link
    """
    file_id = url.split('/')[-2]
    file_path = 'https://drive.google.com/uc?export=download&id=' + file_id
    data = pd.read_csv(file_path)

    return data

In [31]:
def compute_popularity(df: pd.DataFrame, item_id: str, max_candidates: int):
    """
    calculates mean rating to define popular titles
    """
    popular_titles = df.groupby(item_id).agg({'rating': np.mean})\
                     .sort_values(['rating'], ascending=False).head(max_candidates).index.values

    return popular_titles

# 2. Data

## 2. 1. Load data

`interactions` dataset shows list of movies that users watched, along with given ratings:

In [32]:
# interactions data
interactions = read_csv_from_gdrive(RATINGS_SMALL_URL)
interactions.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


`movies_metadata` dataset shows the list of movies existing on OKKO platform:

In [33]:
# information about films etc
movies_metadata = read_csv_from_gdrive(MOVIES_METADATA_URL)
movies_metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


## 2.2 Data preparation

The objective of this step is to identify movies from two datasets that were watched by users.

In [34]:
# align data in both dataframes to merge
interactions['movieId'] = interactions['movieId'].astype(str)
movies_metadata.rename(columns = {'id': 'movieId'}, inplace = True)

In [35]:
# leave only those films that intersect with each other
interactions_filtered = interactions.loc[interactions['movieId'].isin(movies_metadata['movieId'])]
print(interactions.shape, interactions_filtered.shape)

(100004, 4) (44989, 4)


*-- explain why we need a mapper?*

In [36]:
# crate mapper for movieId and title names
item_name_mapper = dict(zip(movies_metadata['movieId'], movies_metadata['original_title']))

In [37]:
# create users input
users = interactions[['userId']].drop_duplicates().reset_index(drop = True)

# 3. Model

Let's define our baseline popularity recommender BaselineRecommender - top rated titles based on average rating with possibility to get by any group(s)

The pipeline will be similar to most python ML modules -- it will have two methods in the end: fit() and recommend()
1. The logic of fit() as follow:
- Initiate recommendation based on median rating from all observations recomm_common;
- Prepare list of interacted items by users
- If we set groups - we get recommendations i.e. calculate movie ratings by groups:
    - If we get NaN, we fill with base recommendations 
    - If we get less than required number of candidates, we populate from base recommendations

2. The logic of recommend():
- Return base recommendations if users data is not set;
- In case of category wise requirement -- we get results of our fit

## 3.1. Fit

In [38]:
# first, we define how many candidates we want to get
MAX_CANDIDATES = 20
ITEM_COLUMN = 'movieId'
USER_COLUMN = 'userId'

In [39]:
# then, we extract top 20 movies by aggregating movies and averaging rating column across all users
base_recommendations = compute_popularity(interactions_filtered, ITEM_COLUMN, MAX_CANDIDATES)
base_recommendations

array(['74727', '128846', '702', '127728', '65216', '43267', '8675',
       '80717', '86817', '8699', '872', '27724', '26791', '876', '64278',
       '301', '59392', '3021', '3112', '1933'], dtype=object)

Thus, we got 20 films with highest average rating

Now, as we discussed earlier, in movies recommendations there is no need to recommend the same film which user has already watched. Let's implement it as well

In [40]:
# we get all interacted items for each user and save it in dictionary {'userId': [items list]}
known_items = interactions_filtered.groupby(USER_COLUMN)[ITEM_COLUMN].apply(list).to_dict()
len(known_items)


671

In [41]:
# let's check it for one userId = 1
known_items[1]

['1371', '1405', '2105', '2193', '2294', '2455']

Now we have all necessary components: base recommendations without groups with possibility to filter already watched items

Also, if we want to get recommendations based on some user groups we can easily do the same with groupby() method and same approach

In [42]:
# lets add artifical binary group to check BaselineRecommender
group = [np.random.random_integers(2) for x in range(len(users))]
users['group'] = group

In [43]:
data = pd.merge(interactions_filtered, users, how='left', on = USER_COLUMN)
group_recommendations = data.groupby('group').apply(compute_popularity, ITEM_COLUMN, MAX_CANDIDATES)
group_recommendations.head()

group
1    [178, 128846, 1543, 64278, 127728, 64499, 6450...
2    [114464, 876, 6163, 565, 935, 166, 392, 3112, ...
dtype: object

In the output we have two rows with a list of film ids for each binary group 

Next, we have to implement recommned() method which will use 

## 3. 2. Recommend

In [44]:
# if we do not have groups, then it means we give the same recommendations for all users i.e. base_recommendations
recs = list(islice(cycle([base_recommendations]), len(users['userId'])))
users['rekkos'] = recs
users.head()

Unnamed: 0,userId,group,rekkos
0,1,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
1,2,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
2,3,1,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
3,4,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
4,5,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."


In [45]:
# and let's have an example with groups we created earlier
group_recommendations = group_recommendations.reset_index()
group_rekkos = pd.merge(users, group_recommendations, how = 'left', on = 'group')
group_rekkos.rename(columns = {0: 'rekkos'}, inplace = True)
group_rekkos.head()

Unnamed: 0,userId,group,rekkos,rekkos.1
0,1,2,"[74727, 128846, 702, 127728, 65216, 43267, 867...","[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."
1,2,2,"[74727, 128846, 702, 127728, 65216, 43267, 867...","[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."
2,3,1,"[74727, 128846, 702, 127728, 65216, 43267, 867...","[178, 128846, 1543, 64278, 127728, 64499, 6450..."
3,4,2,"[74727, 128846, 702, 127728, 65216, 43267, 867...","[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."
4,5,2,"[74727, 128846, 702, 127728, 65216, 43267, 867...","[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."


We got our groupwise recommendations from 3.1. part and just joined them by group of users are assigned to

## 3.3. Wrap everything into pretty functions

### 3.3.1 Fit part

In [46]:
def fit(
    data: pd.DataFrame,
    item_col: str, groups: list = None,
    max_candidates: int = 20
    ):
    """
    function runs all pipeline to generate recommendations based on given group
    :data: dataframe of interactions
    :item_col: item column name
    :groups: optional, list of groups column names to get recommendations
    :max_candidates: number of recommendations to return
    """
    
    if groups is not None:
        recommendations = data.groupby(groups).apply(compute_popularity, item_col, max_candidates)
    else:
        recommendations = compute_popularity(data, item_col, max_candidates)

    return recommendations

In [47]:
# check base
fit(data, item_col=ITEM_COLUMN)

array(['74727', '128846', '702', '127728', '65216', '43267', '8675',
       '80717', '86817', '8699', '872', '27724', '26791', '876', '64278',
       '301', '59392', '3021', '3112', '1933'], dtype=object)

In [48]:
# check group-wise
fit(data, item_col=ITEM_COLUMN, groups=['group'])

group
1    [178, 128846, 1543, 64278, 127728, 64499, 6450...
2    [114464, 876, 6163, 565, 935, 166, 392, 3112, ...
dtype: object

### 3.3.2 Recommend part

In [78]:
def recommend(
    users: pd.DataFrame,
    recommendations: pd.DataFrame,
    groups: list = None,
    K: int = 10):
    """
    recommends items for a given list of users
    :users: series / list of users to recommend
    :recommendations: output of fit() function
    :groups: optional, list of groups column names to get recommendations
    :K: number of items to recommend (not always we want to show dozens of items instantly)
    """
    if groups is not None:
        output = pd.merge(users, recommendations.reset_index(), how = 'left', on = 'group')

    else:
        output = users.copy(deep = True)
        recs = list(islice(cycle([recommendations]), len(users['userId'])))
        output['rekkos'] = recs

    return output


In [50]:
# check
recs = fit(data, item_col=ITEM_COLUMN)
check_recs = recommend(users[['userId', 'group']], recs)
check_recs.head()

Unnamed: 0,userId,group,rekkos
0,1,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
1,2,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
2,3,1,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
3,4,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."
4,5,2,"[74727, 128846, 702, 127728, 65216, 43267, 867..."


In [81]:
# check group-wise
recs = fit(data, item_col=ITEM_COLUMN, groups = ['group'])
check_recs = recommend(users[['userId', 'group']], recs, ['group'])
check_recs.columns

Index(['userId', 'group', 0], dtype='object')

Congrats! Your first basic recommender system is ready!!

# TODO
- Add filtration of watched items to pipeline
- Also, consider cases when you fitler watched ones and you have less items in recommendations than required i.e. number of recommendations < MAX_CANDIDATES

In [52]:
# known_items - dict(user_id, [movies_id_1, ...])

In [64]:
len(max(known_items.values(), key=len))

896

In [53]:
def fit(
    data: pd.DataFrame,
    item_col: str, groups: list = None,
    max_candidates: int = 20 + len(max(known_items.values(), key=len))
    ):
    """
    function runs all pipeline to generate recommendations based on given group
    :data: dataframe of interactions
    :item_col: item column name
    :groups: optional, list of groups column names to get recommendations
    :max_candidates: number of recommendations to return
    """
    
    if groups is not None:
        recommendations = data.groupby(groups).apply(compute_popularity, item_col, max_candidates)
    else:
        recommendations = compute_popularity(data, item_col, max_candidates)

    return recommendations

In [111]:
def recommend(
    users: pd.DataFrame,
    recommendations: pd.DataFrame,
    groups: list = None,
    seen: dict = None,
    K: int = 20):
    """
    recommends items for a given list of users
    :users: series / list of users to recommend
    :recommendations: output of fit() function
    :groups: optional, list of groups column names to get recommendations
    :K: number of items to recommend (not always we want to show dozens of items instantly)
    """
    if groups is not None:
        output = pd.merge(users, recommendations.reset_index(), how = 'left', on = 'group')
        output = output.rename(columns={0: "rekkos"})

    else:
        output = users.copy(deep = True)
        recs = list(islice(cycle([recommendations]), len(users['userId'])))
        output['rekkos'] = recs
        
    if seen is not None:
        output['rekkos'] = output['rekkos'].apply(set)
        output['seen'] = output['userId'].apply(lambda x: set(seen.get(x, -1)))
        output['rekkos'] = output['rekkos'] - output['seen']
        output['rekkos'] = output['rekkos'].apply(list)
        output = output.drop(columns=['seen'])
    
    output['rekkos'] = output['rekkos'].apply(lambda x: x[:K])

    return output


In [112]:
# check
recs = fit(data, item_col=ITEM_COLUMN)
check_recs = recommend(users[['userId', 'group']], recs, seen=known_items)
check_recs.head()

Unnamed: 0,userId,group,rekkos
0,1,2,"[3021, 8675, 872, 43267, 876, 74727, 59392, 26..."
1,2,2,"[3021, 8675, 872, 43267, 876, 74727, 59392, 26..."
2,3,1,"[3021, 8675, 872, 43267, 876, 74727, 59392, 26..."
3,4,2,"[3021, 8675, 872, 43267, 876, 74727, 59392, 26..."
4,5,2,"[3021, 8675, 872, 43267, 876, 74727, 59392, 26..."


In [113]:
len(check_recs['rekkos'][0])

20

In [114]:
# check group-wise
recs = fit(data, item_col=ITEM_COLUMN, groups = ['group'])
check_recs = recommend(users[['userId', 'group']], recs, ['group'])
check_recs.head()

Unnamed: 0,userId,group,rekkos
0,1,2,"[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."
1,2,2,"[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."
2,3,1,"[178, 128846, 1543, 64278, 127728, 64499, 6450..."
3,4,2,"[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."
4,5,2,"[114464, 876, 6163, 565, 935, 166, 392, 3112, ..."


# So, What is Next?

Well, in this section we discussed how basic recommendations based on heuristic can be done
- We took top-rated films and recommended to users
- Added filter to remove already watched films
- Wrapped all steps into functions


In the next chapter we will talk about a bit relatively more advanced techniques like Content-Based / Collaborative Filtering