# Matrix completion via recommendation system example

This example demonstrates the use of matrix completion techniques on a recommendation system.  The recommendation system uses data from the [360K Last.fm dataset](http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html).

In [1]:
%pip install -U implicit h5py

Collecting implicit
  Downloading implicit-0.7.2-cp311-cp311-win_amd64.whl.metadata (6.3 kB)
Collecting h5py
  Downloading h5py-3.15.1-cp311-cp311-win_amd64.whl.metadata (3.1 kB)
Downloading implicit-0.7.2-cp311-cp311-win_amd64.whl (750 kB)
   ---------------------------------------- 0.0/750.8 kB ? eta -:--:--
   --------------------------------------- 750.8/750.8 kB 15.8 MB/s eta 0:00:00
Downloading h5py-3.15.1-cp311-cp311-win_amd64.whl (2.9 MB)
   ---------------------------------------- 0.0/2.9 MB ? eta -:--:--
   ---------------------------------------- 2.9/2.9 MB 27.9 MB/s eta 0:00:00
Installing collected packages: h5py, implicit

  Attempting uninstall: h5py

    Found existing installation: h5py 3.13.0

    Uninstalling h5py-3.13.0:

      Successfully uninstalled h5py-3.13.0

   ---------------------------------------- 0/2 [h5py]
   ---------------------------------------- 0/2 [h5py]
   ---------------------------------------- 0/2 [h5py]
   -------------------------------------

In [2]:
# retrieving last.fm dataset
from implicit.datasets.lastfm import get_lastfm
import numpy as np
import pandas as pd
from scipy import sparse
import os
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


## Downloading and saving the Last.fm dataset

In [3]:
filepath = r'datasets/'
Path(filepath).mkdir(exist_ok=True)

if not os.path.exists(filepath + r'artist_user_plays.npz'):
    # save our dataset in sparse format
    artists, users, artist_user_plays = get_lastfm()

    sparse.save_npz(filepath + r'artist_user_plays.npz', artist_user_plays)
    np.save(filepath + 'artists.npy', artists)
    np.save(filepath + 'users.npy', users)
else:
    # load our dataset into original format
    artist_user_plays = sparse.load_npz(filepath + r'artist_user_plays.npz')
    artists = np.load(filepath + 'artists.npy', allow_pickle=True)
    users = np.load(filepath + 'users.npy', allow_pickle=True)

184MB [00:12, 15.1MB/s]                              


In [4]:
# investigate the content of the downloaded dataset
artists[np.random.randint(size=50, low=0, high=len(artists))]

array(['apollo lab', 'dr. drakken', 'roots of rebellion',
       'raoul de godewarsvelde', 'maio & co.', 'bhakthi maala',
       'orchestra barocca zefiro - alfredo bernardini',
       'absolutely perfect', 'lechner, anja', 'speedball',
       'i am sanctuary', 'amanda wilkinson', 'wojtek godzisz',
       'uncle jamms army', 'steve sharples', 'the streamers', 'agog',
       'dj benzi & lil wayne', 'piotr bukartyk', 'inveracity',
       't. griffin', 'peter iljitsch tschaikowsky', 'toothfairy',
       'nina puslar', 'the cracow klezmer band', 'maysa matarazzo',
       'sadri alışık', 'renee sandstrom', "harmonia & eno '76",
       'discípulos de dionisos', 'plus instruments', 'vivian girls',
       'béla fleck & chick corea', 'dj piccolo', 'sergio franchi',
       'bonnevill', 'massive attack & mad professor', 'the art of voice',
       'mezzanine owls', 'neoangin', 'kelly bell band',
       'acido criollo trio', 'ППК', 'oldboy ost', 'burkhard dallwitz',
       'the animals', 'simentera

In [5]:
users[0:50]

array(['00000c289a1829a808ac09c00daf10bc3c4e223b',
       '00001411dc427966b17297bf4d69e7e193135d89',
       '00004d2ac9316e22dc007ab2243d6fcb239e707d',
       '000063d3fe1cf2ba248b9e3c3f0334845a27a6bf',
       '00007a47085b9aab8af55f52ec8846ac479ac4fe',
       '0000c176103e538d5c9828e695fed4f7ae42dd01',
       '0000ee7dd906373efa37f4e1185bfe1e3f8695ae',
       '0000ef373bbd0d89ce796abae961f2705e8c1faf',
       '0000f687d4fe9c1ed49620fbc5ed5b0d7798ea20',
       '0001399387da41d557219578fb08b12afa25ab67',
       '000163263d2a41a3966a3746855b8b75b7d7aa83',
       '0001a57568309b287363e72dc682e9a170ba6dc2',
       '0001a88a7092846abb1b70dbcced05f914976371',
       '0001bd96207f323b53652bf400702719ad456d3c',
       '000215d3060a5b0ab7b3c415d49ec579100d4c87',
       '00024b5b85c40f990c28644d53257819980bf6bb',
       '00026e8fc41980c9605eac741cd97b8216d2dbbd',
       '000294c1f0d9b40067487457ca31f0caab81d44a',
       '00029d80b8af94f2d5e3349ceb28b7304f80c1c4',
       '0002dd2154072434d26e540

In [6]:
# return the dimensions of data
artists.shape, users.shape, artist_user_plays.shape

((292385,), (358868,), (292385, 358868))

In [7]:
artist_user_plays

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 17535606 stored elements and shape (292385, 358868)>

In [8]:
# return the number of non-missing entries 
artist_user_plays.count_nonzero()

17535605

In [9]:
# investigate the proportion of non-zero entries
artist_user_plays.count_nonzero() / np.prod(artist_user_plays.shape)

0.00948688424830965

## Preparing the data
Okapi BM25 (Best Matching) scoring is a ranking algorithm used by search engines to estimate the relevance of items to a given search query, based on the frequency of occurrences and the size of the reference pool.  The origin of the algorithm is used in search terms in a pool of documents.

For completeness, the BM25 score of query $Q=\{q_1, \ldots, q_n\}$ for a document $D$ is calculated as:

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \frac{IDF(q_i) \cdot f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgD}})},$$
where
- $IDF(q_i)$ is the inverse document frequency of term $q_i$.
- $f(q_i, D)$ is the term frequency of $q_i$ in the document $D$.
- $k_1$ and $b$ are parameters controlling term saturation and document length normalization.
- $D$ is the length of the document.
- $\text{avgD}$ is the average document length in the corpus.


In [11]:
from implicit.nearest_neighbours import bm25_weight

# using the weighting function for normalization
artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)

In [12]:
user_plays = artist_user_plays.T.tocsr()

In [13]:
user_plays.shape

(358868, 292385)

## Training the model with alternating least squares

In [14]:
from implicit.als import AlternatingLeastSquares

# using alternating least squares algorithm
model = AlternatingLeastSquares(factors=16, regularization=0.05, alpha=2.0)

model.fit(user_plays)

  check_blas_config()
100%|██████████| 15/15 [00:20<00:00,  1.40s/it]


## Similar artists recommendation

In [21]:
# generate similar artist recommendation
artist_id = list(artists).index('beyonce')

ids, scores = model.similar_items(artist_id)

pd.DataFrame({'artists': artists[ids], 'score': scores})

Unnamed: 0,artists,score
0,beyonce,1.0
1,ina,0.98031
2,bayje,0.972187
3,ivena,0.969071
4,calvin13,0.968697
5,cherish ft yung joc,0.968645
6,jordin sparks ft chris brown,0.968601
7,digga,0.968221
8,m. pokora,0.967144
9,matt pokora - newszik.blogspot.com,0.966513


## User-specific recommendation

In [26]:
# generate user-based recommendation
user_id = 10

ids, scores = model.recommend(user_id, user_plays[user_id], N=100,  filter_already_liked_items=True)

In [27]:
pd.DataFrame({'artists': artists[ids], 'score': scores})

Unnamed: 0,artists,score
0,hello saferide,0.928602
1,anna ternheim,0.926110
2,glasvegas,0.923304
3,miss li,0.897377
4,moneybrother,0.892571
...,...,...
95,antony and the johnsons,0.728593
96,my darling you!,0.728539
97,cornelis vreeswijk,0.727510
98,ed harcourt,0.727034
