# 01 - Dataset management

The [MovieLens25M](https://grouplens.org/datasets/movielens/25m/) is a popular dataset for recommender systems and is used in academic publications. The dataset contains 25M movie ratings for 62,000 movies given by 162,000 users. Many projects use only the user/item/rating information of MovieLens, but the original dataset provides metadata for the movies, as well. For example, which genres a movie has. Although we may not improve state-of-the-art results with our neural network architecture in this example, we will use the metadata to show how to multi-hot encode the categorical features.

This notebook covers the following steps:
1. Copy the `movielens` dataset to Cloud Storage
2. Create Vertex AI Dataset resouces

## Setup

In [None]:
import os
import cudf 
import tensorflow.io as tf_io
from nvtabular.utils import download_file
from google.cloud import aiplatform as vertex_ai

In [None]:
PROJECT = 'merlin-on-gcp'
REGION = 'us-central1'
BUCKET = 'merlin-on-gcp'

WORKSPACE = f"gs://{BUCKET}/movielens25m"

MOVIES_DATASET_DISPLAY_NAME = 'movielens25m-movies'
RATINGS_DATASET_DISPLAY_NAME = 'movielens25m-ratings'

In [None]:
CLEAN_WORKSPACE = True

if CLEAN_WORKSPACE and tf_io.gfile.exists(WORKSPACE):
    print("Cleaning up the workspace...")
    tf_io.gfile.rmtree(WORKSPACE)

if not tf_io.gfile.exists(WORKSPACE):
    print("Creating a new workspace...")
    tf_io.gfile.mkdir(WORKSPACE)

print("Workspace is ready.")

## 1. Copy the Dataset to Cloud Storage

### Download the dataset

In [None]:
download_file(
    "http://files.grouplens.org/datasets/movielens/ml-25m.zip",
    "ml-25m.zip"
)

In [None]:
!rm ml-25m.zip
!ls ml-25m

### Display sample data

In [None]:
movies = cudf.read_csv("ml-25m/movies.csv")
movies.head()

In [None]:
ratings = cudf.read_csv("ml-25m/ratings.csv")
ratings.head()

### Upload CSV data files to Cloud Storage

In [None]:
MOVIES_GCS_LOCATION = os.path.join(WORKSPACE, "dataset", "movies.csv")
RATINGS_GCS_LOCATION = os.path.join(WORKSPACE, "dataset", "ratings.csv")

!gsutil cp ml-25m/movies.csv {MOVIES_GCS_LOCATION}
!gsutil cp ml-25m/ratings.csv {RATINGS_GCS_LOCATION}
!rm -r ml-25m

## 2. Create Vertex AI Dataset Resources

In [None]:
vertex_ai.init(
    project=PROJECT,
    location=REGION,
     staging_bucket=BUCKET,
)

In [None]:
vertex_ai.TabularDataset.create(
    display_name=MOVIES_DATASET_DISPLAY_NAME, gcs_source=MOVIES_GCS_LOCATION)

In [None]:
vertex_ai.TabularDataset.create(
    display_name=RATINGS_DATASET_DISPLAY_NAME, gcs_source=RATINGS_GCS_LOCATION)

In [None]:
vertex_datasets = vertex_ai.TabularDataset.list()
for vertex_dataset in vertex_datasets:
    print("Dataset display name:", vertex_dataset.display_name)
    print("Dataset gcs location",  vertex_dataset.gca_resource.metadata['inputConfig']['gcsSource']['uri'])
    print()