# Objective

**MUST** Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book’s website and the fast.ai forums for ideas. Note that there are more columns in the full dataset—see if you can use those too (the next chapter might give you ideas). 

I used the 1M dataset instead of the 20M because 20M dataset requires one hour for training one epoch.

In [1]:
import fastbook
fastbook.setup_book()

In [3]:
from fastbook import *
from fastai.collab import *
from fastai.tabular.all import *

In [39]:
def rmse(r_pred, r_true):
    return ((r_pred - r_true)**2).mean()**0.5

## 1M dataset

In [58]:
# download full data
path = untar_data("http://files.grouplens.org/datasets/movielens/ml-1m.zip")

In [63]:
ratings = pd.read_csv(path/'ratings.dat', delimiter='::', header=None,
                      names=('UserID','MovieID','Rating','Timestamp'), engine='python')
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [64]:
movies = pd.read_csv(path/'movies.dat', delimiter='::', header=None,
                     usecols=(0,1), names=('MovieID','Title'), engine='python')
movies.head()

Unnamed: 0,MovieID,Title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [65]:
ratings = ratings.merge(movies)
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975)
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975)
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975)
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975)
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975)


In [79]:
dls = CollabDataLoaders.from_df(ratings, item_name='Title', bs=512)
dls.show_batch()

Unnamed: 0,UserID,Title,Rating
0,2663,"Few Good Men, A (1992)",4
1,5880,Dirty Dancing (1987),3
2,198,12 Angry Men (1957),5
3,5319,Robocop 2 (1990),4
4,909,"Sound of Music, The (1965)",2
5,5746,Alien: Resurrection (1997),3
6,6035,Lost in Space (1998),1
7,517,Raiders of the Lost Ark (1981),3
8,2866,Sleepers (1996),3
9,3352,Bringing Out the Dead (1999),5


In [86]:
embs = get_emb_sz(dls)
embs

[(6041, 210), (3707, 160)]

Default setting for layers is `layers=[50]`

In [128]:
learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), metrics=rmse)

In [129]:
learn.fit_one_cycle(15, 1e-3, wd=0.8)

epoch,train_loss,valid_loss,rmse,time
0,0.841621,0.839743,0.915919,00:12
1,0.823161,0.820149,0.905176,00:12
2,0.808445,0.806419,0.897533,00:12
3,0.793406,0.797748,0.892674,00:12
4,0.789112,0.788893,0.88771,00:12
5,0.786067,0.786942,0.886643,00:12
6,0.778984,0.781575,0.88358,00:12
7,0.778827,0.770108,0.87709,00:12
8,0.755187,0.765821,0.874674,00:12
9,0.739292,0.753396,0.867521,00:12


So my best error rate is `0.853697`, rank 7.5 in the benchmark.

# Benchmark 
Comparison from [paperswithcode](https://paperswithcode.com/sota/collaborative-filtering-on-movielens-1m):
- 1. **GRAEM** RMSE(Root Mean Squared Error) = 0.818
- 10. **Factorized EAE** RMSE = 0.860	

## My Improvements
- Batch Size = 512, made training time 6x faster than bs=64
- Use automated embedding size by using `use_nn=True`
- Higher weight decay