To start, clone or download this repository and navigate to the project's root directory.
We are using the open source MIND: Microsoft News Dataset. After agreeing to the terms, links are available to download and unzip the following files. Please note you only need the large test data along with the two small datasets:
- Training Set (
MINDsmall_train.zip
) - Validation Set (
MINDsmall_dev.zip
) - Test Set (
MINDlarge_test.zip
)
Visit https://msnews.github.io/ to download the files above.
Start by deploying Azure Synapse and its related resources:
- This button links to the Azure custom deployment page where you can use the azuredeploy.json as your Azure Resource Manager (ARM) template.
- If you prefer to setup manually, you need to deploy Azure Synapse Analytics with a Spark pool setup in the workspace and access to Azure Data Lake (Gen2) Storage Account.
In this step you will upload the MIND: Microsoft News Dataset datasets to the Azure Data Lake (Gen2) Storage.
File upload is available by downloading the Azure Storage Explorer application or using azcopy.
- Open the Microsoft Azure Storage Explorer application
- Connect to your Azure account
- In the Explorer, expand your subscription and find the storage account deployed in Step 1
- Expand "Blob containers" and click on the
cms
container - Create a new folder named
MicrosoftNewsDataset
and double-click into it - Drag & drop or click
Upload > Upload Folder...
for the following unzipped MIND folders:MINDsmall_train/
(Training Set)MINDsmall_dev/
(Validation Set)MINDlarge_test/
(Test Set)
Before you can upload assests to the Synapse Workspace you will need to add your IP address:
- Go to the Synapse resouce you created in the previous step.
- Navigate to
Firewalls
underSecurity
on the left hand side of the page. - At the top of the screen click
+ Add client IP
- Your IP address should now be visable in the IP list (optionally, assign other users' IPs)
In order to perform the necessary actions in Synapse workspace, you will need to grant more access.
- Go to the Azure Data Lake Storage Account created above
- Go to the
Access Control (IAM) > + Add > Add role assignment
- Now click the Role dropdown and select
Storage Blob Data Contributor
- Search for your Synapse workspace name (ie
recommend-synapse-workspace
) - Als add your username and any other usernames to the search bar
- Search for your Synapse workspace name (ie
- Click
Save
at the bottom - Repeat steps 2-4 to add the
Contributor
role to the Synapse workspace as well
To enable other users to use this storage account after you create your workspace, perform these tasks:
- Assign other users to the Contributor role on workspace
- Assign other users the appropriate Synapse RBAC roles using Synapse Studio
- Assign yourself and other users to the Storage Blob Data Contributor role on the storage account
- Launch the Synapse workspace (via Azure portal > Synapse workspace > Workspace web URL)
- Go to
Develop
, click the+
, and clickImport
to select all Spark notebooks from the repository's/src/
folder - For each of the notebooks, select
Attach to > spark1
in the top dropdown - Update
account_name
variable to your ADLS in the 01-Load-Data.ipynb notebook - Publish your new notebooks so they are saved in your workspace
- Run the following notebooks in order:
Visualize the personalized recommendations using a Power BI dashboard:
-
Download Power BI Desktop
-
Open the reports/ContentRecommendations.pbit file
-
Cancel the Refresh pop-up since the data source needs to be updated
-
Click
Transform data > Data source settings > Change Source...
from the top menu -
Update the Server field with your
Serverless SQL endpoint
which can be found withinAzure > Synapse workspace > Overview
. -
Keep database as
default
and clickOK
You have completed this solution accelerator and should now have a report to explore the personalized recommendations: