# Train SAR on MovieLens with Azure Machine Learning (Python, CPU)
---
## Introduction to Azure Machine Learning  
The **[Azure Machine Learning service (AzureML)](https://docs.microsoft.com/azure/machine-learning/service/overview-what-is-azure-ml)** provides a cloud-based environment you can use to prep data, train, test, deploy, manage, and track machine learning models. By using Azure Machine Learning service, you can start training on your local machine and then scale out to the cloud. With many available compute targets, like [Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) and [Azure Databricks](https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks), and with [advanced hyperparameter tuning services](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters), you can build better models faster by using the power of the cloud.

Data scientists and AI developers use the main [Azure Machine Learning Python SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro?view=azure-ml-py) to build and run machine learning workflows with the Azure Machine Learning service. You can interact with the service in any Python environment, including Jupyter Notebooks or your favorite Python IDE. The Azure Machine Learning SDK allows you the choice of using local or cloud compute resources, while managing and maintaining the complete data science workflow from the cloud.
![AzureML Workflow](https://docs.microsoft.com/en-us/azure/machine-learning/service/media/overview-what-is-azure-ml/aml.png)

This notebook provides an example of how to utilize and evaluate the Simple Algorithm for Recommendation (SAR) algorithm using the Azure Machine Learning service. It takes the content of the [SAR quickstart notebook](sar_movielens.ipynb) and demonstrates how to use the power of the cloud to manage data, switch to powerful GPU machines, and monitor runs while training a model. 

See the hyperparameter tuning notebook for more advanced use cases with AzureML.

### Advantages of using AzureML:
- Manage cloud resources for monitoring, logging, and organizing your machine learning experiments.
- Train models either locally or by using cloud resources, including GPU-accelerated model training.
- Easy to scale out when dataset grows - by just creating and pointing to new compute target

---
## Details of SAR
<details>
    <summary>Click to expand</summary>
    
SAR is a fast scalable adaptive algorithm for personalized recommendations based on user transaction history. It produces easily explainable / interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. SAR is a kind of neighborhood based algorithm (as discussed in [Recommender Systems by Aggarwal](https://dl.acm.org/citation.cfm?id=2931100)) which is intended for ranking top items for each user. 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users who have interacted with one item are also likely to have interacted with another. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time
- Fast scoring, only involving multiplication of the similarity matric with an affinity vector

### Notes to use SAR properly:
- SAR does not use item or user features, so cannot handle cold-start use cases
- SAR requires the creation of an $mxm$ dense matrix (where $m$ is the number of items). So memory consumption can be an issue with large numbers of items.
- SAR is best used for ranking items per user, as the scale of predicted ratings may be different from the input range and will differ across users.
For more details see the deep dive notebook on SAR here: [SAR Deep Dive Notebook](../02_model_collaborative_filtering/sar_deep_dive.ipynb)</details>
---
## Prerequisities
   - **Azure Subscription**
     - If you don’t have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning service today](https://azure.microsoft.com/en-us/free/services/machine-learning/).
     - You get credits to spend on Azure services, which will easily cover the cost of running this example notebook. After they're used up, you can keep the account and use [free Azure services](https://azure.microsoft.com/en-us/free/). Your credit card is never charged unless you explicitly change your settings and ask to be charged. Or [activate MSDN subscriber benefits](https://azure.microsoft.com/en-us/pricing/member-offers/credit-for-visual-studio-subscribers/), which give you credits every month that you can use for paid Azure services.
---   

In [1]:
# set the environment path to find Recommenders
import os
import shutil
import numpy as np
from tempfile import TemporaryDirectory
from recommenders.datasets import movielens

In [2]:
# top k items to recommend
TOP_K = 10

# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

### Connect to an AzureML workspace

An [AzureML Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML Workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, deployment, inferencing, and the monitoring of deployed models.

The function below will get or create an AzureML Workspace and save the configuration to `aml_config/config.json`.

It defaults to use provided input parameters or environment variables for the Workspace configuration values. Otherwise, it will use an existing configuration file (either at `./aml_config/config.json` or a path specified by the config_path parameter).

Lastly, if the workspace does not exist, one will be created for you. See [this tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace#portal) to locate information such as subscription id.

In [3]:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Workspace, Data, AzureDataLakeGen2Datastore, Datastore
from azure.identity import DefaultAzureCredential
from azure.ai.ml.constants import AssetTypes

ml_client = MLClient.from_config(DefaultAzureCredential())

datastore = ml_client.datastores.get(name='workspaceblobstore')

Found the config file in: /mnt/batch/tasks/shared/LS_root/mounts/clusters/activate-training-02-vm/code/Users/pgabriel/SAR-Recommenders/aml_config/config.json


### Create a Temporary Directory
This directory will house the data and scripts needed by the AzureML Workspace

In [4]:
tmp_dir = TemporaryDirectory()

### Download dataset and upload to datastore

Every workspace comes with a default [datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data) (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and access it from the compute target.

The data files are uploaded into a directory named `data` at the root of the datastore.

In [5]:
TARGET_DIR = 'movielens'

# download dataset
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=['UserId','MovieId','Rating','Timestamp']
)

# upload dataset to workspace datastore
data_file_name = "movielens_" + MOVIELENS_DATA_SIZE + "_data.csv"
# data.to_pickle(os.path.join(tmp_dir.name, data_file_name))
data.to_csv(os.path.join(tmp_dir.name, data_file_name))


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.81k/4.81k [00:00<00:00, 35.9kKB/s]


In [6]:
# %%writefile $tmp_dir.name/MLTable
# type: mltable

# paths:
#   - pattern: ./*.csv
# transformations:
#   - read_delimited:
#       delimiter: ,
#       encoding: ascii
#       header: all_files_same_headers

In [7]:

ds = datastore
# ds.upload(src_dir=tmp_dir.name, target_path=TARGET_DIR, overwrite=True, show_progress=False)
my_path = os.path.join(tmp_dir.name, data_file_name)

my_data = Data(
    path=my_path,
    type=AssetTypes.URI_FILE,
    description= "Movielens size 100k",
    name="movielens100kdataset",
    version='0.0.3'
)

ml_client.data.create_or_update(my_data)


Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'movielens100kdataset', 'description': 'Movielens size 100k', 'tags': {}, 'properties': {}, 'id': '/subscriptions/b068fa50-ccf9-4b66-88e6-659b8f777d02/resourceGroups/ActivateAzureML-RG/providers/Microsoft.MachineLearningServices/workspaces/ActivateAzureML-WS/data/movielens100kdataset/versions/0.0.3', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/activate-training-02-vm/code/Users/pgabriel/SAR-Recommenders/notebooks', 'creation_context': <azure.ai.ml._restclient.v2022_05_01.models._models_py3.SystemData object at 0x7f47716788e0>, 'serialize': <msrest.serialization.Serializer object at 0x7f473bb79c10>, 'version': '0.0.3', 'latest_version': None, 'path': 'azureml://subscriptions/b068fa50-ccf9-4b66-88e6-659b8f777d02/resourcegroups/ActivateAzureML-RG/workspaces/ActivateAzureML-W