Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Search dataset using tag information

In this tutorial, you will learn how to search dataset using tag information using python dataframe. This example uses NYC Taxi Data. To prepare the dataset to be used in this notebook, please complete the [prep_dataset.ipynb](./prep_dataset.ipynb) first.

This tutorial includes the following tasks:
* Configure Azure ML workspace
* Load dataset and store dataset using 'easydict'
* Create tag filter method
* Search tag using predefined method 

## Prerequiste

* Please complete the [prep_dataset.ipynb](./prep_dataset.ipynb) first

## Configure Azure ML workspace


In [None]:
# Load required python packages
from azureml.core import Workspace, Run, Model, Dataset
from azureml.data import OutputFileDatasetConfig
from datetime import datetime
from easydict import EasyDict as edict
import numpy as np
import pandas as pd 
from azureml.core import Experiment
import os

In [None]:
# Setup workspace info
subscription_id = '<your_subscription_id>'
resource_group = '<your_resource_group>'
workspace_name = '<your_workspace_name>'

ws = Workspace(subscription_id, resource_group, workspace_name)

In [None]:
# Get all dataset for current workspace
ws.datasets

## Load dataset and store dataset using 'easydict'

In [None]:
# Store dataset list to edict
ed_datasets = edict(ws.datasets)

# Show dataset list
datasets_list = list(ed_datasets.keys())
datasets_list

In [None]:
# Process tags using Pandas dataframe 
# Load all data info to Pandas Dataframe
ds_list = []
ds_dict = {}
for _, _dataset in enumerate(datasets_list):
    ds = Dataset.get_by_name(ws, _dataset)
    for j in range(1, ds.version+1): # This code is for getting all dataset version data
        j = str(j)
        vds = Dataset.get_by_name(ws, _dataset, version=j)
        ds_dict = vds.tags
        ds_dict["dataset_id"] = vds.id
        ds_dict["dataset_name"] = vds.name
        ds_dict["dataset_version"] = vds.version
        ds_list.append(ds_dict)
df_dataset = pd.DataFrame.from_dict(ds_list) 

In [None]:
df_dataset

## Create tag filter method

You can use this method to search for a dataset with a specific tag value. 

In [None]:
# Create tag filter statement 
def filter_dataset_using_tags(**taglist):
    # Step1. create filter condition list
    filter_condition_list = []
    for k, v in taglist.items():
        condition = f'({k}==\'{v}\')'
        filter_condition_list.append(condition)
    # Step2. join condition list
    condition = '&'.join(filter_condition_list)
    # Step3. show query result
    display(df_dataset.query(condition))

## Search tag using predefined method

In [None]:
# Search tags depends on various condition.
# Case1 - check the dataset which is version - original
taglist = {'version':'original'}
filter_dataset_using_tags(**taglist)

In [None]:
# Case2 - check the dataset which is version - original and type - yellow
taglist = {'version':'original', 'type':'yellow'}
filter_dataset_using_tags(**taglist)