# Matrix completion via recommendation system example

This example demonstrates the use of matrix completion techniques on a recommendation system.  The recommendation system uses data from the [360K Last.fm dataset](http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html).

In [1]:
pip install -U implicit h5py

Collecting implicit
  Downloading implicit-0.7.2.tar.gz (70 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting h5py
  Using cached h5py-3.15.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.0 kB)
Using cached h5py-3.15.1-cp312-cp312-macosx_11_0_arm64.whl (2.8 MB)
Building wheels for collected packages: implicit
  Building wheel for implicit (pyproject.toml) ... [?25ldone
[?25h  Created wheel for implicit: filename=implicit-0.7.2-cp312-cp312-macosx_15_0_arm64.whl size=678781 sha256=d356706ea0d8f6595430a7bac68ca7ad486e1f96b1924cb7d4a2f61a727f829f
  Stored in directory: /Users/liviafingerson/Library/Caches/pip/wheels/b2/00/4f/9ff8af07a0a53ac6007ea5d739da19cfe147a2df542b6899f8
Successfully built implicit
Installing collected packages: h5py, implicit
[2K  Attempting uninstall: h5py
[2K    Found existing installation: h5py 3.11.0
[2K    Uninstalling h5py-3.11.0

In [2]:
# retrieving last.fm dataset
from implicit.datasets.lastfm import get_lastfm
import numpy as np
import pandas as pd
from scipy import sparse
import os
from pathlib import Path

## Downloading and saving the Last.fm dataset

In [3]:
filepath = r'datasets/'
Path(filepath).mkdir(exist_ok=True)

if not os.path.exists(filepath + r'artist_user_plays.npz'):
    # save our dataset in sparse format
    artists, users, artist_user_plays = get_lastfm()

    sparse.save_npz(filepath + r'artist_user_plays.npz', artist_user_plays)
    np.save(filepath + 'artists.npy', artists)
    np.save(filepath + 'users.npy', users)
else:
    # load our dataset into original format
    artist_user_plays = sparse.load_npz(filepath + r'artist_user_plays.npz')
    artists = np.load(filepath + 'artists.npy', allow_pickle=True)
    users = np.load(filepath + 'users.npy', allow_pickle=True)

0.00B [00:00, ?B/s]

In [10]:
# investigate the content of the downloaded dataset

artists[0:50] # look at first 50
artists[np.random.randint(size=50, low=0, high=len(artists))] # look at 50 random artists
users[np.random.randint(size=50, low=0, high=len(users))] # look at 50 random users

array(['0f81d438a2f78c9e457b0f317ab324b4cafffb56',
       '8d4b2f9843a2d3b197fdf8f00e08c77beb9b8798',
       'af56ef9490069411c1bf6b6d228ead18f29677a4',
       '838da7618ef395edd47cb280fd52c182c03cbad5',
       '209bf643dd62b16d09b306b67ac644eef2e0de4c',
       '2afd21289b7491ff38d2087d9e66eef6f04109dd',
       '7a3219ecce71ea9783cc69101f34e6a7dfbcca34',
       'e45d9187e2269572e50e04c50826e3f5b9fe8f56',
       'ab4948a41a2ce48d3ec39d3a85615553f3c8d577',
       'fa5244dc9bc9ab9529396ad7c345f6df09cc7ed8',
       '7d90969e3971bd3e19bb1fba45af4148d80fdb11',
       'aff79e4b6cacb39f2d396d60958e28028fb8d5fa',
       '7302d5bd308370df9e94951e40a7c359a2a3c29d',
       'a91fd07f13ec1b560306050b622c47a2f5bc4006',
       '99c5b4cea96762277736d1897da2635b219c6c32',
       '6fa600dd43076ede5a88bd4ee6c5ca531624cd9a',
       '2993c0625c81b812c31aff462212d5cb1a30b0ae',
       '9b34451755b82ee18fa55eda00fe76892a46694e',
       'fb78735fdc4fa5b5ad82cc2b09222959d3b20323',
       'e2cb3dd7e52686af2729875

In [11]:
# return the dimensions of data

artist_user_plays.shape, users.shape, artists.shape

((292385, 358868), (358868,), (292385,))

In [16]:
# return the number of non-missing entries 
artist_user_plays.count_nonzero()

17535605

In [19]:
# investigate the proportion of non-zero entries

artist_user_plays.count_nonzero() / np.prod(artist_user_plays.shape)

# 17535606 / (292385 * 358868)

0.0001671209636692248

## Preparing the data
Okapi BM25 (Best Matching) scoring is a ranking algorithm used by search engines to estimate the relevance of items to a given search query, based on the frequency of occurrences and the size of the reference pool.  The origin of the algorithm is used in search terms in a pool of documents.

For completeness, the BM25 score of query $Q=\{q_1, \ldots, q_n\}$ for a document $D$ is calculated as:

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \frac{IDF(q_i) \cdot f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgD}})},$$
where
- $IDF(q_i)$ is the inverse document frequency of term $q_i$.
- $f(q_i, D)$ is the term frequency of $q_i$ in the document $D$.
- $k_1$ and $b$ are parameters controlling term saturation and document length normalization.
- $D$ is the length of the document.
- $\text{avgD}$ is the average document length in the corpus.


In [20]:
from implicit.nearest_neighbours import bm25_weight

# using the weighting function for normalization

artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8) # k1 - neighbors, b - how much strength we are pulling back

In [21]:
user_plays = artist_user_plays.T.tocsr()

In [22]:
user_plays.shape

(358868, 292385)

## Training the model with alternating least squares

In [23]:
from implicit.als import AlternatingLeastSquares

# using alternating least squares algorithm

model = AlternatingLeastSquares(factors=16, regularization=0.05, alpha=2.0) # factors - how big of a pool you use, regularization, alpha

model.fit(user_plays)



  check_blas_config()


  0%|          | 0/15 [00:00<?, ?it/s]

## Similar artists recommendation

In [37]:
# generate similar artist recommendation
#list(artists).index('')


# how to find similar items to the beatles

artist_id = 252512 # beatles
artist_id = 84228 # david bowie
artist_id = 262845 # the smiths

ids, scores = model.similar_items(artist_id)

pd.DataFrame({'ids': ids, 'artists': artists[ids], 'score': scores})

Unnamed: 0,ids,artists,score
0,262845,the smiths,1.0
1,192419,morrissey,0.996911
2,212909,pixies,0.992118
3,154231,joy division,0.989172
4,281709,yeah yeah yeahs,0.986983
5,84228,david bowie,0.986954
6,254550,the cure,0.986907
7,264338,the velvet underground,0.986645
8,276889,violent femmes,0.986512
9,261517,the raveonettes,0.986391


## User-specific recommendation

In [51]:
# generate user-based recommendation

user_id = 99996

ids, scores = model.recommend(user_id, user_plays[user_id], N = 10, filter_already_liked_items=True) # filter out items the user has already liked

pd.DataFrame({'ids': ids, 'artists': artists[ids], 'score': scores})

Unnamed: 0,ids,artists,score
0,201618,npr,1.109099
1,260451,the onion,1.08759
2,259805,the mountain goats,1.086237
3,197911,neutral milk hotel,1.074767
4,190206,mitch hedberg,1.068583
5,250181,ted leo and the pharmacists,1.048994
6,142552,iron & wine,1.04578
7,149440,jim gaffigan,1.029315
8,272741,ugly casanova,1.025461
9,85081,david sedaris,1.024138
