# Collaborative filtering

A common problem is, instead of prediction, we wish to provide recommendations for items a user may like, based on prior individual-level data. The solution to this problem is called _collaborative filtering_.

The idea with, say, movie recommendation, is to find users who have similar preferences to you, and then find movies which similar users have liked, that you haven't seen.

Items, in a more general sense, can include links that you click on, diagnoses that are selected for patients, and so on.

The underlying idea is that there must be some underlying set of features, not necessarily labelled, which determine an association between users and items. The objective is to uncover _latent factors_ which determine this association.

In [2]:
from fastai.collab import *
from fastai.tabular.all import *

## Movie Lens dataset

In [3]:
path = untar_data(URLs.ML_100k)

█

In [4]:
path

Path('/home/jupyter/.fastai/data/ml-100k')

In [5]:
Path.BASE_PATH = path

In [6]:
path.ls()

(#23) [Path('ub.test'),Path('u3.base'),Path('ub.base'),Path('u4.base'),Path('u4.test'),Path('u1.base'),Path('ua.base'),Path('u.occupation'),Path('README'),Path('u.data')...]

The main table is `u.data`.

In [8]:
ratings = pd.read_csv(path/'u.data', delimiter='\t', 
                      names=['user', 'movie', 'rating', 'timestamp'])

In [9]:
ratings.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [10]:
ratings.shape

(100000, 4)

In [14]:
ratings.rating.unique()

array([3, 1, 2, 4, 5])

In [15]:
len(ratings.movie.unique())

1682

In [16]:
len(ratings.user.unique())

943

In [72]:
top_movies = (ratings[['movie', 'user']]
              .groupby('movie')
              .count()
              .sort_values(by='user', ascending=False)[:10]
              .index
              .values
              .ravel()
             )
top_users = (ratings[ratings['movie'].isin(top_movies)]
              .groupby('user')
              .count()
              .sort_values(by='user', ascending=False)[:10]
              .index
              .values
              .ravel()
            )

movie_slice = (ratings[['user', 'movie', 'rating']]
 .pivot(index='user', columns='movie')
 .loc[top_users, (slice(None), top_movies)]
)

In [74]:
movie_slice.style.background_gradient(cmap='RdYlGn', vmin=1, vmax=5)

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movie,1,50,100,121,181,258,286,288,294,300
user,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
943,,4.0,5.0,3.0,4.0,,,,,
942,,5.0,,,,4.0,,,,5.0
941,5.0,,,,5.0,4.0,,,4.0,4.0
940,,4.0,3.0,,3.0,5.0,3.0,,4.0,5.0
939,,,,5.0,,4.0,,,,
938,4.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,,3.0
937,,5.0,3.0,,,4.0,4.0,,1.0,4.0
936,4.0,4.0,4.0,4.0,4.0,3.0,5.0,,3.0,3.0
935,3.0,,3.0,4.0,4.0,,5.0,,,4.0
934,2.0,5.0,4.0,3.0,4.0,,4.0,,,


The objective of collaborative filtering is to fill in the blanks of this table, and recommend the movie the user is most likely to like to watch next.