# Movie recommendations using ALS

This is the main notebook with training ALS (Alternating Least Squares) model on `ua.base`, as well as evaluation on test set from `ua.test` (however, other data sets can be specified). Then, there are some examples of recommendations for the user and for similar movie. 

This notebook was ran in Kaggle (as the gpu was needed).

In [1]:
!git clone https://github.com/leiluk1/movie-recommender-system.git
%cd /kaggle/working/movie-recommender-system/

Cloning into 'movie-recommender-system'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 14 (delta 0), reused 11 (delta 0), pack-reused 0[K
Receiving objects: 100% (14/14), 4.73 MiB | 20.25 MiB/s, done.
/kaggle/working/movie-recommender-system


In [2]:
!pip install -q implicit

In [3]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from implicit.gpu.als import AlternatingLeastSquares
import scipy.sparse as sp
from benchmark.evaluate import evaluate



In [4]:
import zipfile

with zipfile.ZipFile('./data/raw/ml-100k.zip', 'r') as zip_ref:
        zip_ref.extractall('./data/raw/')

## Step 1. Data Preprocessing

In [5]:
# some constants (were found in 1.0 notebook)
num_users = 943
num_movies = 1682

# create sparse matrix for given dataset 
def create_csr_matrix(data_set='ua', mode='base'):
    main_dict = {}
    df = pd.read_csv(f'./data/raw/ml-100k/{data_set}.{mode}',
                          delimiter='\t',
                          names=['user_id', 'movie_id', 'rating', 'timestamp'])
    df.sort_values(by=['user_id'], inplace=True)
    for _, row in df.iterrows():
        user_id, movie_id, rating = row['user_id'], row['movie_id'], row['rating']

        if user_id not in main_dict:
            main_dict[user_id] = [(movie_id, rating)]
        else:
            main_dict[user_id] += [(movie_id, rating)]
    
    user_matrix = np.zeros((num_users, num_movies), dtype=np.float64)
    
    for i, val in enumerate(main_dict.values()):
        for movie_id, rating in val:
            user_matrix[i, (movie_id - 1)] = rating  # as movies ids are from 1
    
    sparse_matrix = sp.csr_matrix(user_matrix, dtype=np.float64)
    return sparse_matrix

In [7]:
# create sparse matrices for train and test datasets from ua.base (train) and ua.test files
train_users_sparse = create_csr_matrix(data_set='ua', mode='base')
test_users_sparse = create_csr_matrix( data_set='ua', mode='test')

In [10]:
train_users_sparse

<943x1682 sparse matrix of type '<class 'numpy.float64'>'
	with 90570 stored elements in Compressed Sparse Row format>

## Step 2. Training the model

In [13]:
import threadpoolctl
threadpoolctl.threadpool_limits(1, "blas")


# search for best model 
factors = [30, 50, 70, 100, 200]
regularizations = [0.001, 0.01, 0.1, 1.0]
iterations = [25, 50, 100, 200]
best_map10 = 0.0
best_ndcg10 = 0.0

for factor in factors:
    for regularization in regularizations:
        for iteration in iterations:
            print(f'Factors: {factor}, regularization: {regularization}, iterations: {iteration}')
            model = AlternatingLeastSquares(factors=factor, 
                                            regularization=regularization, 
                                            iterations=iteration,
                                            alpha=5,
                                            random_state=22) # set random seed for reproducibility
            
            model.fit(train_users_sparse, show_progress=False)
            
            # evaluate model on test dataset using method from benchmark evaluate.py
            p10, map10, ndcg10 = evaluate(model, train_users_sparse, test_users_sparse)
            print('------------------------------------------------')
            
            if map10 > best_map10 and ndcg10 > best_ndcg10:
                best_p10 = p10
                best_map10 = map10
                best_ndcg10 = ndcg10
                best_params = {'factors': factor, 'regularization': regularization, 'iterations': iteration}
                
                # save best model
                model.save('/kaggle/working/movie-recommender-system/models/best')
                

Factors: 30, regularization: 0.001, iterations: 25
Precision@10=0.2502, MAP@10=0.1509, NDCG@10=0.2863
------------------------------------------------
Factors: 30, regularization: 0.001, iterations: 50
Precision@10=0.2525, MAP@10=0.1516, NDCG@10=0.2879
------------------------------------------------
Factors: 30, regularization: 0.001, iterations: 100
Precision@10=0.2531, MAP@10=0.1519, NDCG@10=0.2880
------------------------------------------------
Factors: 30, regularization: 0.001, iterations: 200
Precision@10=0.2528, MAP@10=0.1516, NDCG@10=0.2878
------------------------------------------------
Factors: 30, regularization: 0.01, iterations: 25
Precision@10=0.2512, MAP@10=0.1512, NDCG@10=0.2869
------------------------------------------------
Factors: 30, regularization: 0.01, iterations: 50
Precision@10=0.2530, MAP@10=0.1521, NDCG@10=0.2882
------------------------------------------------
Factors: 30, regularization: 0.01, iterations: 100
Precision@10=0.2525, MAP@10=0.1518, NDCG@10

In [14]:
# params of best model
best_params

{'factors': 30, 'regularization': 1.0, 'iterations': 200}

## Step 3. Evaluation results

In [15]:
print('Evaluation results:')
print(f'Test Precision@10: {best_p10:.4f}')
print(f'Test MAP@10: {best_map10:.4f}')
print(f'Test NDCG@10: {best_ndcg10:.4f}')

Evaluation results:
Test Precision@10: 0.2549
Test MAP@10: 0.1531
Test NDCG@10: 0.2899


## Step 3. Inference or recommendation examples

In [16]:
model = AlternatingLeastSquares.load('./models/best')

In [17]:
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings_ds = pd.read_csv('./data/raw/ml-100k/u.data', 
                         delimiter='\t', 
                         names=ratings_cols)

In [18]:
movies_cols = ['movie_id', 'movie_title', 'release_date', 'video_release_date', 'imdb_url', 
               'unknown', 'action', 'adventure', 'animation', 'childrens', 'comedy', 'crime', 
               'documentary', 'drama', 'fantasy', 'film-noir', 'horror', 'musical', 'mystery', 'romance', 'sci-fi', 'thriller', 'war', 'western']

movies_ds = pd.read_csv('./data/raw/ml-100k/u.item', 
                         delimiter='|', 
                         encoding='latin-1',
                         names=movies_cols)

movies_id_title = movies_ds[['movie_id', 'movie_title']]

In [19]:
merged_ds = ratings_ds.merge(movies_id_title, on='movie_id')
merged_ds.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [20]:
merged_ds.shape[0] == ratings_ds.shape[0]

True

### Movies recommendations for a given user

In [21]:
user_id = 95
fav_10_movies_titles = merged_ds.query(f'user_id == {user_id}') \
                        .sort_values(by='rating')[-10:]['movie_title'] \
                        .values.tolist()

print(f'Favourite 10 films of user {user_id}:\n')
for movie in fav_10_movies_titles:
    print(movie)

Favourite 10 films of user 95:

Raiders of the Lost Ark (1981)
Bridge on the River Kwai, The (1957)
Much Ado About Nothing (1993)
Blues Brothers, The (1980)
Cyrano de Bergerac (1990)
Lion King, The (1994)
American Werewolf in London, An (1981)
Manchurian Candidate, The (1962)
English Patient, The (1996)
My Fair Lady (1964)


In [22]:
users_sparse = create_csr_matrix(data_set='u', mode='data')

ids, _ = model.recommend((user_id - 1), users_sparse[user_id - 1], N=10, 
                              filter_already_liked_items=True)

recommended_10_movies = merged_ds[merged_ds['movie_id'].isin(ids)]['movie_title'] \
                        .unique().tolist()

print(f'10 movies recommendations for user {user_id}: \n')
for movie in recommended_10_movies:
    print(movie)

10 movies recommendations for user 95: 

Star Trek III: The Search for Spock (1984)
Tales From the Crypt Presents: Demon Knight (1995)
Vertigo (1958)
Unbearable Lightness of Being, The (1988)
Corrina, Corrina (1994)
Parent Trap, The (1961)
Snow White and the Seven Dwarfs (1937)
Ref, The (1994)
Breaking the Waves (1996)
Lone Star (1996)


### Similar movies suggestions based on a given movie (The Lion King)

In [24]:
film_id = movies_ds[movies_ds['movie_title'] == 'Lion King, The (1994)']['movie_id']
film_ids, _ = model.similar_items((film_id - 1), N = 10)

similar_10_films = merged_ds[merged_ds['movie_id'].isin(film_ids[0])]['movie_title'] \
                        .unique().tolist()

print('10 similar films to The Lion King:')
similar_10_films

10 similar films to The Lion King:


['Silence of the Lambs, The (1991)',
 'Crow, The (1994)',
 'Twelve Monkeys (1995)',
 'Hour of the Pig, The (1993)',
 'Home Alone (1990)',
 'Bedknobs and Broomsticks (1971)',
 'Four Weddings and a Funeral (1994)',
 'Parent Trap, The (1961)',
 'Bad Boys (1995)',
 'Fly Away Home (1996)']

At least, we can assess the suggested films that are similar to 'The Lion King'. As can be concluded, films such as 'Home Alone', 'Bedknobs and Broomsticks', 'The Parent Trap' and 'Fly Away Home' would be suitable recommendations to see after watching 'The Lion King'. 