<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/recsys/merlin/movielens_merlin_01_download_convert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MovieLens 25M Dataset
## 01. Download & Convert
MovieLens 25M movie ratings. Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019

ml-25m.zip (size: 250 MB)
Permalink: https://grouplens.org/datasets/movielens/25m/

Many projects use only the user/item/rating information of MovieLens, but the original dataset provides `metadata` for the movies, as well. For example, `which genres a movie has`. Although we may not improve state-of-the-art results with our neural network architecture in this example, we will use the metadata to show how to multi-hot encode the categorical features.

In [1]:
!nvidia-smi

Sat Apr  2 04:53:51 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install -qq nvtabular

[?25l[K     |                                | 10 kB 27.4 MB/s eta 0:00:01[K     |▏                               | 20 kB 11.1 MB/s eta 0:00:01[K     |▎                               | 30 kB 8.9 MB/s eta 0:00:01[K     |▍                               | 40 kB 8.1 MB/s eta 0:00:01[K     |▌                               | 51 kB 4.6 MB/s eta 0:00:01[K     |▋                               | 61 kB 5.4 MB/s eta 0:00:01[K     |▊                               | 71 kB 5.3 MB/s eta 0:00:01[K     |▉                               | 81 kB 5.3 MB/s eta 0:00:01[K     |█                               | 92 kB 5.9 MB/s eta 0:00:01[K     |█                               | 102 kB 5.2 MB/s eta 0:00:01[K     |█▏                              | 112 kB 5.2 MB/s eta 0:00:01[K     |█▎                              | 122 kB 5.2 MB/s eta 0:00:01[K     |█▍                              | 133 kB 5.2 MB/s eta 0:00:01[K     |█▌                              | 143 kB 5.2 MB/s eta 0:00:01[K   

## Download the Dataset

In [3]:
# External dependencies
import os

from merlin.core.utils import download_file

# Get dataframe library - cudf or pandas
from merlin.core.dispatch import get_lib
df_lib = get_lib()

We define out base input directory containing the data.

In [7]:
!mkdir -p /content/movielens/
!mkdir -p /content/movielens/data/
INPUT_DATA_DIR = os.environ.get(
    "INPUT_DATA_DIR", os.path.expanduser("/content/movielens/data/")
)
INPUT_DATA_DIR

'/content/movielens/data/'

We will download and unzip the data.

In [8]:
download_file(
    "http://files.grouplens.org/datasets/movielens/ml-25m.zip",
    os.path.join(INPUT_DATA_DIR, "ml-25m.zip"),
)

downloading ml-25m.zip: 262MB [00:03, 76.7MB/s]                           
unzipping files: 100%|██████████| 8/8 [00:07<00:00,  1.01files/s]


## Convert the dataset
First, we take a look on the movie metadata.

In [9]:
movies = df_lib.read_csv(os.path.join(INPUT_DATA_DIR, "ml-25m/movies.csv"))
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


We can see, that genres are a multi-hot categorical features with different number of genres per movie. Currently, genres is a String and we want split the String into a list of Strings. In addition, we drop the title.

In [None]:
movies["genres"] = movies["genres"].str.split("|")
movies = movies.drop("title", axis=1)
movies.head()

Unnamed: 0,movieId,genres
0,1,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,"[Adventure, Children, Fantasy]"
2,3,"[Comedy, Romance]"
3,4,"[Comedy, Drama, Romance]"
4,5,[Comedy]


In [None]:
movies.to_parquet(os.path.join(INPUT_DATA_DIR, "movies_converted.parquet"))

In [None]:
ratings = df_lib.read_csv(os.path.join(INPUT_DATA_DIR, "ml-25m", "ratings.csv"))
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [None]:
ratings = ratings.drop("timestamp", axis=1)

# shuffle the dataset
ratings = ratings.sample(len(ratings), replace=False)

# split the train_df as training and validation data sets.
num_valid = int(len(ratings) * 0.2)

train = ratings[:-num_valid]
valid = ratings[-num_valid:]

In [None]:
train.to_parquet(os.path.join(INPUT_DATA_DIR, "train.parquet"))
valid.to_parquet(os.path.join(INPUT_DATA_DIR, "valid.parquet"))