# Recommendations for MovieLens dataset

This tutorial shows how to train WSKNN model on MovieLens dataset. We are going to load data from a flat file, and then transform it to k-NN mappings of session-items and item-sessions.

*MovieLens dataset*

```
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872
```

*Data schema*

```
ratings.csv

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
```

In [17]:
import numpy as np
import pandas as pd

from wsknn import fit
from wsknn.preprocessing.static_parsers.pandas_parser import parse_pandas
from wsknn.evaluate import score_model

In [4]:
fpath = 'demo-data/movielens/ml-25m/ratings.csv'
df = pd.read_csv(fpath)
ds = parse_pandas(df, session_id_key='userId', product_key='movieId', time_key='timestamp')

del df

In [5]:
model = fit(sessions=ds['session-map'],
            items=ds['item-map'],
            number_of_recommendations=5,
            number_of_neighbors=10,
            sampling_strategy='recent',
            sample_size=50,
            weighting_func='log',
            ranking_strategy='log',
            return_events_from_session=False,
            recommend_any=False)

In [6]:
def get_sample_sessions(set_of_sessions, n_sessions=100):
    sessions_keys = list(set_of_sessions.keys())
    key_sample = np.random.choice(sessions_keys, n_sessions)
    sampled = [set_of_sessions[dk] for dk in key_sample]
    return sampled

In [14]:
def get_movie_name(movie_id: str):
    with open('demo-data/movielens/ml-25m/movies.csv', 'r') as fin:
        header = next(fin)
        for line in fin:
            splitted = line.split(',')
            if movie_id == int(splitted[0]):
                return splitted[1]

In [8]:
test_sessions = get_sample_sessions(set_of_sessions=ds['session-map'], n_sessions=5)

In [15]:
for ts in test_sessions:
    print('User watched')
    print(str([get_movie_name(x) for x in ts[0]]))
    print('Recommendations')
    recs = model.recommend(ts)
    for rec in recs:
        print('Item:', get_movie_name(rec[0]), '| weight:', rec[1])
    print('---')
    print('')
    

User watched
['Austin Powers: International Man of Mystery (1997)', '"Blues Brothers', 'Shrek (2001)', 'Happy Gilmore (1996)', 'Annie Hall (1977)', 'Butch Cassidy and the Sundance Kid (1969)', "Monty Python's Life of Brian (1979)", 'Psycho (1960)', 'Airplane! (1980)', 'Citizen Kane (1941)', 'North by Northwest (1959)', 'Ransom (1996)', 'Almost Famous (2000)', 'French Kiss (1995)', 'M*A*S*H (a.k.a. MASH) (1970)', 'Sling Blade (1996)', '"Honey', '"Crouching Tiger', "Ferris Bueller's Day Off (1986)", '"Princess Bride', 'Rain Man (1988)', '"Breakfast Club', '"Sound of Music', '"Maltese Falcon', 'Lost in Translation (2003)', '"Usual Suspects', 'My Big Fat Greek Wedding (2002)', 'Pirates of the Caribbean: The Curse of the Black Pearl (2003)', 'Big Fish (2003)', 'Hoosiers (a.k.a. Best Shot) (1986)', '"River Runs Through It', 'Magnolia (1999)', '"Straight Story', 'Children of a Lesser God (1986)', 'In the Heat of the Night (1967)', 'Waiting for Guffman (1996)', "My Best Friend's Wedding (1997)

In [16]:
# Score system

In [21]:
test_sessions = get_sample_sessions(set_of_sessions=ds['session-map'], n_sessions=500)
scores = score_model(test_sessions, model, k=5, skip_short_sessions=True, sliding_window=False)

In [22]:
print(scores)

{'MRR': 0.4972666666666667, 'Precision': 0.324, 'Recall': 0.020594599441273644}
