# Napkin Notebook: How long should RecSys experiments take based on our results so far?

## Background
We've run quite a few experiments on MovieLens 1M. This dataset is described below in a dictionary of relevant descriptive stats.

In [1]:
movielens_1m = {
    'num_users': 6040,
    'num_movies': 3900,
    'num_ratings': 1000209,
}

On average our movielens 1m experiments were relatively quick.

SVD = about 100 + 150 seconds PER fold.
KNN = anywhere between 130 (for heavy boycott) to 500 (for light boycott) seconds per fold.

For now, let's focus on SVD, assuming KNN will be very difficult to scale to big datasets without some major modifications (esp. considering the cost can't be "pre-paid", you have to run expensive computations when producing predictions).

We'll assume an average about about 250 seconds per fold, or 1250 seconds per experiment.

In [7]:
seconds_per_experiment = 1250

minutes_per_experiment = seconds_per_experiment / 60
print(minutes_per_experiment)

20.833333333333332


Useful resources:

http://acsweb.ucsd.edu/~dklim/mf_presentation.pdf

https://web.stanford.edu/~lmackey/papers/cf_slides-pml09.pdf
We see here that solving SVD with missing values through Stochastic Gradient Descent (i.e. our implementation) should have time complexity of O(NK) per epoch.

U users, M items. We have a U x M matrix R. 

R = A X B^T
A is U x K (user factor), B is M x K (item factor)

N is the total number of observed ratings, (user, item, rating).

For more detail: http://www.cs.utexas.edu/~inderjit/public_papers/kais-pmf.pdf


Movielens 20M looks like this:

In [3]:
movielens_20m = {
    'num_users': 138493,
    'num_movies': 27278,
    'num_ratings': 20000263,
}

So let's see what the ratios look like:

In [5]:
ratios = {key: movielens_20m[key]/movielens_1m[key] for key in movielens_20m.keys()}
print(ratios)

{'num_users': 22.92930463576159, 'num_movies': 6.994358974358974, 'num_ratings': 19.99608381848194}


For SVD, our matrix is going to consist of m rows for m users and n columns for n items.

Therefore, when going from Movielens 1M to MovieLens 20M we'd expect to see an increase consistent with the increases in number of users and movies.

In [13]:
increase = ratios['num_ratings']

new_minutes_per_experiment = minutes_per_experiment * increase
new_hours_per_experiment = new_minutes_per_experiment / 60
new_days_per_experiment = new_hours_per_experiment / 24
print(new_hours_per_experiment, new_days_per_experiment)

6.943084659195118 0.28929519413312993
